JSON Parser

This StAX parser allows to easily read JSONs directly to the application data structures, skipping the building of JSON DOM and skipping all irrelevant pieces of data in the JSON stream.

Why StAX

99% of real world applications read JSONs to their own data structures.

Different parsers usually provide two main approaches to do so:

  • Most parsers first read the file to the Document Object Model data structures (DOM) and then let the application to convert these DOM nodes into application objects. This leads to the substantial memory and CPU overheads.
  • Other parsers provide the application with streaming-push interface (SAX). In this approach JSON library becomes just-a-lexer, and all actual parsing is delegated to the application, that has to implement some hand-written ingenious state machine, that:
    • maps keys to fields,
    • switches contexts and mappings on each object and array start/end,
    • skips all unneeded structures,
    • handles different variants of mappings,
    • converts, checks and normalizes data.

So these approaches are either slow and resource consuming or painful to use, but there is a third way free from these downsides: a streaming pull interface (StAX) that implemented in Argentum json_Parser class.

  • Like SAX it parses data on the fly without building intermediate DOM.
  • But unlike it, Argentum Parser doesn't feed application with stream of tokens. Instead it allows application to query for the data this application expects.
  • If some parts of incoming JSON left not claimed, they simply get skipped.
  • This combines the simplicity of DOM-based parsers with high efficiency of SAX parsers.

Usage Examples

Create and initialize Parser instance

using json { Parser }

p = Parser.init(jsonText);

Parser p is ready to parse text from jsonText.

Existing parser can be reset to parse another JSON (or the same JSON data from the beginning) by calling init on the existing parser object. You don't need to recreate parsers every time.

Every standard JSON file contains a single root node that can be:

  • a number, boolean value, string, null,
  • an array of nodes,
  • an object, which is a key-value collection, where key is always a string, and value is a node.

Argentum json_Parser initially stays on the root node, from which you can get a node of type you application expects. Usually its either object or array.

Read Arrays from JSON

To read an array from the current position, use getArr that takes a ()void lambda as a parameter. This lambda will be called for each encountered array item:

p.getArr{
   log("an array item!")
};

Actually this lambda is intended not to just log the fact the array item is seen, but to extract array item the way the application expects. So let's extract primitive nodes.

Read primitive data

Numbers

To extract numeric node, call

  • getNum(defaultValue double) double - returns either extracted value, or a defaultValue, if the current item in a stream is not a number.
  • or tryNum() ?double - it returns ?double that tells both if current node is actually a number and its value if it is.

You can call getNum/tryNum:

  • right after the Parser.init to check/extract a numeric root node (it's weird but legit),
  • or you can call getNum/tryNum from getArr lambdas to fetch array items,
  • or use them inside getObj lambdas (explained later) to extract object fields:
p.init("[1, 2, 3]");
p.getArr {
   p.tryNum() ? log("item {_} ");
};
// it prints: item 1 item 2 item 3

In this example we:

  • initialize a parser with text "[1,2,3]", which is an array of numbers,
  • parse the root item as an array
  • try to extract numbers out of each array item,
  • and if it's a number, print it.
// Or the same as above but using getNum and inline lambda:
p.getArr\log("item {p.getNum(0.0)} ");

BTW Argentum JSON Parser always treat numbers as doubles. It's by JSON standard. All parsers that expect anything other than 52-bit mantissa (double) from numeric items are not portable, not interoperable and not standard-conformant.

Booleans

To extract boolean values, call:

  • getBool(defaultValue bool) bool - returns either an extracted value or a defaultValue
  • or tryBool() ?bool - returns ?bool - that tells both if the current value in the stream is a bool, and if it is - holds it's value.
p.init("[true, 5, false]");
p.getArr{
    p.tryBool()
      ? log("item {_?"true":"false"} ")
      : log("not a bool!");
};
// it prints: item true not a bool! item false

Strings

To extract string values, use tryStr() or getStr(defaultValue str) methods:

p.init("
  [
    "Baba", " ",
    "yetu"
  ]
");
p.getArr{
    p.tryStr() ? log(_);
};
// it prints: Baba yetu

Parser processes and checks utf-8 runes, handles all single-character escape sequences, like "\n\t" etc. It processes \uFFFF-encoded Unicode codepoints, validates and combines surrogate pairs into valid utf-8 runes.

Sometimes application knows in advance that text is limited to some sane amount of characters. In this case instead of tryStr/getStr it might call tryStrWithLimit/getStrWithLimit that extracts no more than given number of utf8 runes, skipping the rest of the string.

Null

To extract null-value, use tryNull method, which simply returns bool indicating if there was null in the stream or not.

When to use try* and get*

All try* methods if they don't see their corresponding data types in the stream, just return optional-nothing and leave the stream intact. This allows multiple attempts to extract data in different ways. For example, if we have multiple versions of different JSONs where certain bool flags sometimes returned as bool, sometimes as number 0/1, sometimes as strings "yes/no", we can handle that.
Let's extend the json_Parser with a new method:

class Parser {
  getBoolMyWay(def bool) bool {
    tryNum() ? _ != 0.0 : 
    tryStrWithLimit(5) ? _ == "true" || _ == "yes" || _ == "1" :
    getBool(def)
  }
}

log(Parser.init("1").getBoolMyWay(false) ? "aye" : "nope");
// this prints aye

So the rule of thumb is:

  • use try* methods:
    • if you need to check for multiple primitive types in one field/array item
    • or you need some special handling if input data is of unexpected type
  • use get* methods:
    • if you have only one type expected
    • and you are ok with default value.

Read JSON objects

Objects are handled with getObj method. This method takes a lambda that is called for each object field. This lambda has one parameter - a field name, and it can use all Parser's try or get-methods to extract field values. Example:

p.init("
  {
    "name": "Andrew",
    "unexpected data": false,
    "year": 1972
  }
");
p.getObj {
    _=="year" ? log("field year with number {p.getNum(0.0)}"):
    _=="name" ? log("field name with string {p.getStr("unknown")}");
};
// This prints
//   field name with string Andrew
//   field year with number 1972

Create Arrays/Objects on demand

In most cases the object which fields we want to fill from JSON exists regardless if current JSON position contains object or not. But sometimes, we create objects and arrays only if current element is an actual array or object. For this scenario Parser has two additional predicates:

  • isArr() bool - tells if the current element is an array
  • isObj() bool - tells if it's an object

They are intended to be used this way:

x = json.isArr() ? DoubleArray.{
   json.getArr\_.append(json.getNum(0.0))
}

This code checks if this is an array, and only if it is, it creates a DoubleArray instance and fills it with numbers from JSON. Variable x here is of type optional DoubleArray.

Error Handling

Parser has four methods for error handling:

  • getErrorMessage() ?str - allows to know if the parser is in an error state and get its error message
  • getErrorPos() ?Cursor - allows to get the rest of unparsed text after error.
  • setError(text str) - allows to switch the Parser into an error state (if it's not already in one) and sets a error message.
  • success() - checks if the parser successfully parsed all its text start-to-end.

In the error state Parser returns false/optional-none/defaultVal to all calls, ends all iterations by array items and object fields. Once entered the error state, it can be cleared only with init method that starts a new parsing.

Complex Example

Lets assume that our application has two classes - a Point and a Polygon:

class Polygon {
    name = "";
    points = Array(Point);
    isActive = false;
}
class Point{
    x = 0f;
    y = 0f;
}

Our application expects JSON to contain an array of points, something like this:

[
    {
        "active": false,
        "name": "p1",
        "points": [
            {"x": 11, "y": 32},
            {"y": 23, "x": 12},
            {"x": -1, "y": 4}
        ]
    },
    {
        "points": [
            {"x": 10, "y": 0},
            {"x": 0, "y": 10},
            {"y": 0, "x": 0}
        ],
        "active": true,
        "name": "Corner"
    }
]

This structure can be parsed in a straightforward way:

fn readPolygonsFromJson(data str) Array(Polygon) {
    Array(Polygon).{                                                           // 1
       json = Parser.init(data);
       json.getArr\_.append(Polygon)-> json.getObj `f (                        // 2
          f=="active" ? _.isActive := json.getBool(false) :
          f=="name"   ? _.name := json.getStr("") :
          f=="points" ? json.getArr\_.points.append(Point)-> json.getObj `f (  // 3
              f=="x" ? _.x := float(json.getNum(0.0)) :
              f=="y" ? _.y := float(json.getNum(0.0))));
       json.success() : log("parsing error {json.getErrorMessage()}{json.getErrorPos()?" at {_.offset}"}");
    }
 }

In this code:

  • In line #1 we create and return an array of Polygons, but before returning, we handle it using the "colombo" operator: result_expression.{ actions }. See details here. So inside {}-block, the "_" name denotes the resulting array.
  • In line #2 we create a Polygon instance for each array item, append it to the array and pass it with -> operator to the expression json.getObj so inside the getObj lambda, the "_" name refers to this newly created polygon.
  • In line #3 we do the same trick with a new Point instance that is inserted into the current polygon's points array. So inside the second getObj lambda, the "_" name corresponds to a newly created point.

These 12 lines of code handled all the edge cases:

  • If input JSON has an unexpected field, this field is skipped with all its subtree.
  • This code is tolerant to any order of fields in objects.
  • If some field is absent from the JSON, the corresponding object gets a default value.
  • If some field, root element or array item in the JSON is of an unexpected type, it will be skipped and replaced with the default value.
    For example, if json.getBool(true) is called on an array of objects, this array gets skipped and the default result (true) is stored.
  • Since all parsing is performed in plain Argentum code, we can easily add validating/transforming/versioning logic without inventing template-driven or string-encoded DSLs.
  • Parser is extremely rigid and resilient, it validates its input against JSON standard, detects and reports all errors.

Bottom line

  • This is the first module entirely written in Argentum.
  • It is implemented in 50% less lines of code in comparison to C++ version. So Argentum is pretty expressive.
  • TODO: add more bragging and yapping

Readiness

JSON module exists in Argentum built from sources.

It is not yet included in the playground and binary demo.

Leave a Reply

Your email address will not be published. Required fields are marked *