zyla.neutrino.re - How much power is there in `json-schema`?

I was reading the json-schema spec, and it caught my attention that it supports recursion. Recursion frequently gives rise to non-trivial computational power. So the question is: where does json-schema lie on the automata hierarchy?

Referencing subschema definitions reminds me of state transitions in a finite state machine. In the next paragraph we’ll explore how we can express any FSM using json-schema.

With that we can build a little tool to compile a regular expression to an equivalent JSON Schema, which can then be interpreted by any json-schema validator.

(Note: of course JSON Schema has a built-in regex matching feature, but that would be boring, so let’s ignore this and focus on the “core” language of matching JSON structures)

The code for this post is in the regex-to-json-schema repository.

FSM to json-schema

A finite automaton recognizes sequences of symbols from some alphabet. We’ll consider the set of ASCII characters as our alphabet.

json-schema takes JSON documents as input, but finite automata process sequences. We need a way to encode our input seuquence as a JSON document. In our scheme we’ll use a linked list of symbols, where a cons cell is a two-element array, and nil is encoded as an empty array. For example, the string abbc would be encoded as ["a",["b",["b",["c",[]]]]].

States are encoded as subschemas. The root schema references the initial state.

Consuming a character and transitioning to another state is done by matching a two-element array. The first element is the consumed symbol, and the second element is the remainder of the sequence, matched against the state we transition to. For example, consuming b and transitioning to state S2 looks like:

{
  "type": "array",
  "items": [
    { "const": "b" },
    { "$ref": "#/definitions/S2" }
  ],
  "additionalItems": false
}

If there are multiple transitions from a given state, they are composed using oneOf. Note: there can be multiple transitions consuming the same character; that is, we can encode a NFA, not just a DFA.

The terminal state is a schema that matches an empty array:

{
  "const": []
}

Regex to FSM

To complete the regex compiler, we need to convert the regular expression to a NFA. This can be done in several ways, e.g. Thompson’s construction, and won’t be covered here.

Example

Consider the regular expression ab*c. It corresponds to the following automaton:

which, converted to JSON Schema, would be:

{
  "$ref": "#/definitions/S1",
  "definitions": {
    "S1": {
      "type": "array",
      "items": [
        { "const": "a" },
        { "$ref": "#/definitions/S2" }
      ]
    },
    "S2": {
      "oneOf": [
        {
          "type": "array",
          "items": [
            { "const": "b" },
            { "$ref": "#/definitions/S2" }
          ]
        },
        {
          "type": "array",
          "items": [
            { "const": "c" },
            { "$ref": "#/definitions/S3" }
          ]
        }
      ]
    },
    "S3": {
      "const": []
    }
  }
}

We can test it using any JSON Schema validator, e.g. the Python package jsonschema:

$ cat ac.input.json 
["a",["c",[]]]
$ jsonschema regex1.schema.json -i ac.input.json && echo "Matches!"
Matches!

$ cat abbbbbbc.input.json 
["a",["b",["b",["b",["b",["b",["b",["c",[]]]]]]]]]
$ jsonschema regex1.schema.json -i abbbbbbc.input.json && echo "Matches!"
Matches!

$ cat ab.input.json 
["a",["b",[]]]
$ jsonschema regex1.schema.json -i ab.input.json && echo "Matches!"
['b', []]: ['b', []] is not valid under any of the given schemas

$ cat invalid.input.json 
["i",["n",["v",["a",["l",["i",["d",[]]]]]]]]
$ jsonschema regex1.schema.json -i invalid.input.json && echo "Matches!"
i: 'a' was expected
['n', ['v', ['a', ['l', ['i', ['d', []]]]]]]: ['n', ['v', ['a', ['l', ['i', ['d', []]]]]]] is not valid under any of the given schemas

Automating it

The above transformations are encapsulated in a tool - regex-to-json-schema.

$ regex-to-json-schema --help
regex-to-json-schema 

USAGE:
    regex-to-json-schema <SUBCOMMAND>

OPTIONS:
    -h, --help    Print help information

SUBCOMMANDS:
    encode-input    Encode input as a JSON document suitable for validating against a schema
                        generated by the `json-schema` command
    help            Print this message or the help of the given subcommand(s)
    json-schema     Convert the regular expression to JSON Schema. To use the resulting schema,
                        convert input string using the `convert-input` command
    nfa             Convert the regular expression to NFA, and output it in DOT format

It can be combined with a validator like this:

$ jsonschema <(regex-to-json-schema json-schema 'ab*c') -i <(regex-to-json-schema encode-input abbc) && echo "Matches!"
Matches!

Conclusion

JSON Schema can at least recognize regular languages.

Can it do more? In terms of the automata hierarchy, I don’t think so - the next level is pushdown automata, which require a stack. JSON Schema doens’t seem to support any kind of memory other than remembering the current schema. Demonstrating this is left as an exercise for the reader.

Thanks for reading, and hope you enjoyed this!