Module Content
- A StAX (reverse SAX) parser
- A direct writer (in compact and pretty-print forms)
- JSON DOM for special cases.
It's available in playground, so go and try it yourself 🙂
Argentum JSON module provides three separate ways to handle JSONs in your application:
- read them in document object model mode (DOM),
- or read them directly in your application data structures, process naturally and write back,
- or process them in streaming mode, parsing and retaining only small parts of data needed to be accumulated.
Use cases
To be specific we need some sort of common task to be done three different ways allowing us to compare code complexity, allocations, processing speed and other parameters.
In this post we will parse modify and write back a JSON file containing an array of polygons with arrays of points. Something like this:
const xInputJson = "
[
{
"active": false,
"name": "p1",
"points": [
{"x": 11, "y": 32},
{"y": 23, "x": 12},
{"x": -1, "y": 4}
]
},
{
"points": [
{"x": 10, "y": 0},
{"x": 0, "y": 10},
{"y": 0, "x": 0}
],
"active": true,
"name": "Corner"
}
]
";
Our mission, should we accept it, is do modify X
and Y
fields of points
this way:
x := x + y
y := y * 100
Using JSON DOM
Our first candidate is a DOM approach. Mostly because it is the main and sometimes only approach common in other programming languages or in other JSON libraries.
using sys{ Array, log }
using json{ Parser, Writer, JArr, JObj, JNum, read } // <<- additional imports
using array;
// Read. The `root` variable is of type `json_JNode`.
// Please notice that we provide `read` function with a parser object
// which allows us to build DOM data structures out of parts of actual JSON
// calling it in the middle of other types of JSON Parsing
root = read(Parser.init(xInputJson));
// Write it back. Again since we provide `JNode.write` method with a Writer instance,
// we can serialize our DOM data as a part of the other serialization process.
// Also we can fine-tune Writer, producing different JSON formatting.
log(root.write(Writer.useSpaces(2)).toStr());
Reading and writing with DOM is the easiest among all approaches, but let's try to modify this DOM:
root~JArr ? _.each {
_~JObj && _["points"] && _~JArr ? _.each {
_~JObj ? `pt {
pt["x"] && _~JNum ? `x
pt["y"] && _~JNum ? `y {
x.n += y.n;
y.n *= 100.0
}
}
}
};
- In line 1 we check if
root
is an actual array, and if it is, iterate over it - In line 2 we check:
- if array item is an object, and if it is,
- if it has a "points" field, and if it has,
- if this field is an array, and only if all three checks passed, iterate over this array.
- In line 3 we check if current array item is an object, and if it is, store it in a temporary variable
pt
. - In lines 4 and 5 we check if object node
pt
has fields "x" and "y", and if they are numeric nodes, and on success we store these nodes in temporary variablesx
andy
. - Lines 6 and 7 perform the actual modifications of fields
x
andy
.
Skip one check, and code won't compile:
- without
~JArr
in line 1, you can't iterate, becauseJNode
is not an array and has noeach
method, - without
?
in the same line, yo can't call method, because typecast operator~
returns optional<pointer>, and you need to unwrap it with?
to extract the actual pointer in order to call method, and this is applicable to every statement: you code won't compile until you check all possibly bad corner cases.
You may ask: "why so many checks"? In JavaScript I can just write:
root.forEach(polygon => {
polygon.points.forEach(point => {
point.x += point.y
point.y *= 100;
});
});
Yes and no. This code could crash if input data contains unexpected node types. The safe and resilient JavaScript code looks like this:
if (Array.isArray(root)) {
root.forEach(polygon => {
if (polygon &&
typeof polygon === 'object' &&
Array.isArray(polygon.points))
{
polygon.points.forEach(point => {
if (point &&
typeof polygon === 'object' &&
typeof point.x === 'number' &&
typeof point.y === 'number')
{
point.x += point.y;
point.y *= 100;
}
});
}
});
}
With all these added checks, added safety and resilience, JavaScript code becomes larger and more redundant than Argentum one (for example, it repeatedly access the same object fields over and over, and this fields are actually text keys lookups in hash maps).
Other languages for reference:
Rust example
fn process_dom(root: &mut Value) {
if let Value::Array(polygons) = root {
for polygon in polygons.iter_mut() {
if let Value::Object(polygon_obj) = polygon {
if let Some(Value::Array(points)) = polygon_obj.get_mut("points") {
for point in points.iter_mut() {
if let Value::Object(point_obj) = point {
if let (Some(Value::Number(x)), Some(Value::Number(y))) =
(point_obj.get_mut("x"), point_obj.get_mut("y"))
{
if let (Some(x_val), Some(y_val)) = (x.as_f64(), y.as_f64()) {
*x = json!(x_val + y_val);
*y = json!(y_val * 100.0);
}
}
}
}
}
}
}
}
}
Swift example
func processDom(_ root: inout Any) {
if var rootArray = root as? [[String: Any]] {
for i in 0..<rootArray.count {
var polygon = rootArray[i]
if var points = polygon["points"] as? [[String: Any]] {
for j in 0..<points.count {
if var point = points[j] as? [String: Any],
let x = point["x"] as? Double,
let y = point["y"] as? Double {
point["x"] = x + y
point["y"] = y * 100
points[j] = point // This COW-fighting is a Swift-specific feature
}
}
polygon["points"] = points // And here
rootArray[i] = polygon // And here
}
}
root = rootArray // And here
}
}
// The above example has exponential complexity for nesting levels
// O(N^2) in this case of 2 nesting levels, because Swift arrays and maps
// are having value semantic.
C++ example
void processDom(nlohmann::json& root) {
if (root.is_array()) {
for (auto& polygon : root) {
if (polygon.is_object() &&
polygon.contains("points") &&
polygon["points"].is_array())
{
for (auto& point : polygon["points"]) {
if (point.is_object() &&
point.contains("x") &&
point.contains("y") &&
point["x"].is_number() &&
point["y"].is_number())
{
double x = point["x"];
double y = point["y"];
point["x"] = x + y;
point["y"] = y * 100;
}
}
}
}
}
}
// Please notice that in this example we four times
// search in a hash map by the same string key:
// in lines 10, 12, 15, 17.
It's a good illustration of distinction between Argentum programming language and other languages. In other languages you can easily write unsafe and non-resilient code. While making it safer and more robust takes visible amount of efforts. In contrary Argentum allows you relatively easy create safe and resilient code while making unsafe code is impossible at syntax and type check levels.
Anyways, this DOM approach has number of disadvantages:
- It allocates memory for all nodes even if they are not needed for application. In fact application can't even predict the amount of allocations and some attacker can send huge JSON to deplete memory.
- Application convert these DOM nodes into application data structures anyways. And all these allocations, deallocation, checks and traversal quickly become burden on CPU and memory.
- Sometimes input JSON data contains unexpected data or data which in slightly off the application format than the application expects. Thus DOM approach makes it harder to validate data against strict applications schemas.
That's why it is usually better to read JSON documents directly into application data structures.
Reading JSONs from and writing to application data structures
This approach is already described in posts about StAX parser and Streaming Writer. In these posts we made monolith functions to read and write these data structures. Let's write it here another way:
// First we define application data formats and method of JSON handling:
class Point{
x = 0.0;
y = 0.0;
readField(f str, json Parser) this { // This function handles a single field from JSON
f=="x" ? x := json.getNum(0.0) :
f=="y" ? y := json.getNum(0.0)
}
writeFields(j(str)Writer) { // This function writes all fields to JSON
j("x").num(x);
j("y").num(y)
}
}
class Polygon {
name = "";
points = Array(Point);
isActive = false;
readField(f str, json Parser) this {
f=="active" ? isActive := json.getBool(false) :
f=="name" ? name := json.getStr("") :
f=="points" ? json.getArr\points.append(Point)-> json.getObj`f _.readField(f, json);
}
writeFields(j(str)Writer) {
j("name").str(name);
j("active").bool(isActive);
j("points").arr\points.each`pt _.obj\pt.writeFields(_);
}
}
// Second, add handling of arrays of Polygons:
fn readPolygonsFromJson(data str) Array(Polygon) {
Array(Polygon).{
json = Parser.init(data);
json.getArr\_.append(Polygon)-> json.getObj `f _.readField(f, json);
json.success() : log("parsing error {json.getErrorMessage()}")
}
}
fn writePolygonsToJson(data Array(Polygon)) str {
Writer.useSpaces(1).arr {
data.each `poly _.obj\poly.writeFields(_)
}.toStr()
}
Having these application data formats readers and writers we can make our task as simple as:
xInputJson->readPolygonsFromJson(_).{
_.each\_.points.each {
_.x += _.y;
_.y *= 100.0
}
}->writePolygonsToJson(_)->log(_)
This approach has multiple advantages:
- It strictly checks input data against input schema defined in
read
methods. And we can encapsulate in these methods all input versions and variations. - It produces JSON in the exact schema defined in
write
methods. - There is no "garbage-in garbage-out" as in the DOM approach. All data exceeding application schema is filtered out. It doesn't occupy imemory, it doesn't consume CPU, It doesn't poison the output.
- All data structures in this approach are native data structure and not string-to-node hash maps. It's much more efficient.
At the same time this approach has two disadvantages:
- Much more code.
- We still, as in the DOM approach, allocate all data structures in memory at once.
Processing JSON in a streaming manner
There is a third way. Our reader and parser are combinable, so we can create a streaming processing function:
fn process(inText str) str {
in = Parser.init(inText);
out = Writer.useSpaces(2).arr\in.getArr\_.obj\in.getObj`f (
f=="name" ? _(f).str(in.getStr("")) :
f=="active" ? _(f).bool(in.getBool(false)) :
f=="points" ? _(f).arr\in.getArr\_.obj {
x=0.0;
y=0.0;
in.getObj`f (
f=="x" ? x:=in.getNum(0.0):
f=="y" ? y:=in.getNum(0.0));
_("x").num(x + y);
_("y").num(y * 100.0)
});
out.toStr()
}
log(process(xInputJson))
//or more fancy way:
xInputJson->process(_)->log(_)
- In line 3 we:
- Create and tune-up a
Writer
instance. - Then we write an array (
.arr
). - And fill it by iterating by array items fetched from the input JSON parser (
in.getArr
). - And for each fetched array item we write an object (
_.obj
) - And fill it with fields fetched from the input JSON (
in.getObj
)
- Create and tune-up a
- Lines 4 and 5 replicate
name
andactive
scalar fields, forcing their data to bestring
andbool
respectively. - Line 6 replicates the field
points
containing an array, and like the line 3 we create an array and fill it with content of the input array, but this time we don't replicate fields 1-to-1. Instead we:- accumulate fields in local variables (lines 9, 10, 11),
- and fill the output object with transformed fields.
This approach has a number of advantages:
- No garbage, all data gets filtered and checked against schema.
- Absolutely no build-up of allocations in memory. All data gets processed in streaming mode.
Unfortunately this method has very narrow area where it can be applied.
SAX
I have no idea why, but most of existing (in other languages) JSON libraries support only DOM and SAX parsing. In my humble opinion SAX is the weirdest and the most difficult style of API. But it is also supported in Argentum JSON module. With a small addition:
interface ISaxReader{
onArrayStart();
onArrayEnd();
onObjectStart();
onObjectEnd();
onField(name str);
onNull();
onBool(v bool);
onNum(v double);
onString(v str);
}
parseWithSax(in Parser, r ISaxReader) {
in.tryNum() ? r.onNum(_) :
in.tryStr() ? r.onStr(_) :
in.tryBool() ? r.onBool(_) :
in.tryNull() ? r.onNull() :
in.isArr() ? {
r.onArrayStart();
in.getArr\parseWithSax(in, r);
in.onArrayEnd()
} :
in.isObj() ?{
r.onObjStart()
in.getObj`f {
r.onField(f);
parseWithSax(in, r);
r.onObjEnd()
}
}
This function converts the input JSON into a sequence of calls to ISaxReader
interface. Use it on your discretion.
String-to-string
Sometimes in the middle of stream processing or StAX parsing it gets needed to parse some subtree (array item or specific field) in a pass-through manner, producing a text string of this sub-JSON. This code could help:
fn scan(in Parser, out Writer) {
in.tryNum() ? out.num(_) :
in.tryStr() ? out.str(_) :
in.tryBool() ? out.bool(_) :
in.tryNull() ? out.null() :
in.isArr() ? out.arr\in.getArr\scan(in, _) :
in.isObj() ? out.obj\in.getObj`f scan(in, _(f));
}
Give this function:
- an existing parser positioned at array or field value
- and a newly created
Writer
And it produce a text with filtered, normalized, and formatted JSON representing this subtree.
This function can also be applied to a full JSON document. It is useful to compactify/tabify/indent/unindent various JSONs:
fn compactify(inJson str) str {
Writer.{scan(Parser.init(inJson), _)}.toStr();
}
fn tabify(inJson str) str {
Writer.useTabs().{scan(json_Parser.init(inJson), _)}.toStr();
}
Combining methods
- You can start parsing JSON in a streaming mode.
- Then at a specific field of specific object create an application object and read it with
getObj\readField
- Then inside this object parse some JSON subtree as DOM tree and store it in an object field.
- Or using
scan
function from the previous topic to extract a subtree as a text. - Or while parsing data as application object, read some fields into local variables and transform them into a set of fields in streaming mode.
Combine these methods depending on your goal.
Bottom line
Argentum JSON Module allows to process data in multiple ways: Streaming, DOM, SAX, StAX, direct copy and all combinations of the above.