Validating Path Expressions In Step Functions

I created asl-path-validator to define the grammar for JSONPath expressions supported by AWS Step Functions. The library validates Path, Reference Path, and Payload Template expressions in Step Functions. sign warning hiker to stay on path. json icon and step function icons shown with hill and sun The Amazon States Language (ASL) uses JSONPath for its data mapping expressions and flow control but doesn’t provide a grammar for this language. There is a reference to a Java library for the syntax but the referenced library provides additional functions and operators not supported by Step Functions.

The grammar codifies the rules from the spec in a format we can leverage for a validating parser.

Expression Types

Type Description Rules
Path Selects one or more nodes. Will illustrate with examples below
Reference Path A valid Path expression that MUST select a single node. Operators selecting multiple nodes are not permitted
Payload Template A JSON object or array where all keys ending with .$ are evaluated as Path expressions or an Intrinsic Function See table below for a list of Intrinsic Functions

Expression Features

Feature Path Reference Path Payload Template
Simple dot notation or single predicate notation
$.library.movies
:white_check_mark: :white_check_mark: :white_check_mark:
Use of operators that select multiple nodes via descent, wildcard, or a filter
.. @ , : ? *
:white_check_mark: :x: :white_check_mark:
Intrinsic Functions
States.JsonToString($.foo)
See below for the supported functions
:x: :x: :white_check_mark:

Examples

The spec contains examples for Reference Paths but not many useful examples for other types. Part of the exercise of writing the grammar was figuring out what works in the AWS Data flow simulator and deployed Step Functions.

Expression Path Reference Path Payload Template
$.store.book :white_check_mark: :white_check_mark: :white_check_mark:
$.store\.book :white_check_mark: :white_check_mark: :white_check_mark:
$.\stor\e.boo\k :white_check_mark: :white_check_mark: :white_check_mark:
$.store.book.title :white_check_mark: :white_check_mark: :white_check_mark:
$.foo.\.bar :white_check_mark: :white_check_mark: :white_check_mark:
$.foo\@bar.baz\[\[.\?pretty :white_check_mark: :white_check_mark: :white_check_mark:
$.&Ж中.\uD800\uDF46 :white_check_mark: :white_check_mark: :white_check_mark:
$.ledgers.branch[0].pending.count :white_check_mark: :white_check_mark: :white_check_mark:
$.ledgers.branch[0] :white_check_mark: :white_check_mark: :white_check_mark:
$.ledgers[0][22][315 ].foo :white_check_mark: :white_check_mark: :white_check_mark:
$['store']['book'] :white_check_mark: :white_check_mark: :white_check_mark:
$['store'][0]['book'] :white_check_mark: :white_check_mark: :white_check_mark:
States.Format('Welcome to {} {}s playlist.', $$, $.lastName) :x: :x: :white_check_mark:
$[(@.length-1)].bar :white_check_mark: :x: :white_check_mark:
$.library.movies[?(@.genre)] :white_check_mark: :x: :white_check_mark:
$.library.movies[?(@.year == 1992)] :white_check_mark: :x: :white_check_mark:
$.library.movies[0:2] :white_check_mark: :x: :white_check_mark:
$.library.movies[0,1,2,3] :white_check_mark: :x: :white_check_mark:
$..director :white_check_mark: :x: :white_check_mark:
$.fooList[1:] :white_check_mark: :x: :white_check_mark:
$.store.book[*].author :white_check_mark: :x: :white_check_mark:
$.store.* :white_check_mark: :x: :white_check_mark:
$..* :white_check_mark: :x: :white_check_mark:
$.book[-2] :white_check_mark: :x: :white_check_mark:
$.book[-2:] :white_check_mark: :x: :white_check_mark:
$.book[?(@.price <= $['expensive'])] :x: :x: :x:
$.book[?(@.author =~ /.*REES/i)] :x: :x: :x:
..book.length() :x: :x: :x:

Expressions that don’t work in any context
Note that the table above contains a number of examples that don’t work in any context. This testing was done with the AWS Data Flow tool and adhoc step functions. These expressions come from the Java library referenced by both the Amazon States Language and AWS Step Function documentation. If an expression syntax isn’t supported in any context then I opted to not support it in the grammar. For example, the relational operators > >= == < <= only work with numeric values, so the parser emits errors with a non-numeric operand.

Context Expressions

A Context Expression is a Path expression that starts with $$. This uses the process’s Context Object as the document to evaluate against as opposed to the state’s data.

Reference from the spec:

When a Path begins with “$$”, two dollar signs, this signals that it is intended to identify content within the Context Object. The first dollar sign is stripped, and the remaining text, which begins with a dollar sign, is interpreted as the JSONPath applying to the Context Object.

Intrinsic Functions

Name Arguments Comments
States.Array 0+ arguments MAY contain one or more Path values
States.Format 1+ arguments MAY contain one or more Path values
States.JsonToString 1 argument MUST be a Path
States.StringToJson 1 argument MAY be a Path

The grammar currently limits the Intrinsic Functions to those listed above. The ASL spec allows for extension functions but doesn’t describe how they are made known to the system.

How this is done

  1. Update schema definitions to use format for expression fields.
  2. Use patternProperties to validate fields ending in .$
  3. Generate a parser from a PEG grammar to parse the expression
  4. Include additional validation for rules not encoded in the schema

Adding AJV Formats to the schemas

JSON Schema provides a format field for string types to provide additional semantics for validation. There are a few built in formats like date-time or uuid and AJV permits registering new names and validator functions. The asl-path-validator registers three new formats with the following names:

Format Description
asl_path Field must be a Path expression
asl_ref_path Field must be a ** Reference Path** expression
asl_payload_template Field must be a Payload Template expression

The above formats are added to the relevant types in the schema. For example:

Before

{
  "OutputPath": {
    "type": "string"
  }
}

After

{
  "OutputPath": {
    "type": "string",
    "format": "asl_path"
  }
}

Recursive type for Payload Template

There are three rules for a Payload Template:

  1. if the JSON field ends in .$ then it MUST be a Path or Intrinsic Function
  2. if the JSON field doesn’t end in .$ then it MUST be a scalar type OR another Payload Template
  3. if the field is an array, all items MUST be valid Payload Template

Encoding these rules in JSON Schema delegates the traversal logic to AJV and avoids having to make multiple calls to look for nodes to validate.

The three rules above are implemented using oneOf and patternProperties.

{
  "asl_payload_template": {
    "oneOf": [
      {
        "type": "object",
        "patternProperties": {
          "^.+\\.\\$$": {
            "$comment": "matches fields ending in .$",
            "type": "string",
            "nullable": true,
            "format": "asl_payload_template"
          },
          "^.+(([^.][^$])|([^.][$]))$": {
            "$comment": "matches fields NOT ending in .$",
            "oneOf": [
              {
                "type": [
                  "number",
                  "boolean",
                  "string",
                  "null"
                ]
              },
              {
                "type": "array",
                "items": {
                  "$ref": "#/definitions/asl_payload_template"
                }
              },
              {
                "$ref": "#/definitions/asl_payload_template"
              }
            ]
          }
        }
      },
      {
        "type": "array",
        "items": {
          "$ref": "#/definitions/asl_payload_template"
        }
      }
    ]
  }
}

Writing the grammar and generating a parser

Peggy provides a simple language and nice online sandbox to test the grammar against inputs. A single grammar is defined to parse all valid expression types. The parser emits an Abstract Syntax Tree (AST) if the input is valid. Additional traversals of the AST are performed to enforce any rules specific to the context (i.e. Reference Path operator limits or Intrinsic Function use).

Since we’re not evaluating the expressions, the AST only needs to capture a small amount of information about the expression. We only need to record use of specific operators or functions.

For example, the CURRENT_VALUE operator @ is used for filtering nodes is not allowed in Reference Paths.

In the example below, the parser matches on the CURRENT_VALUE token followed by an optional subscript. The AST for this match records the node as @ and whatever the subscript value was.

jsonpath_
   = CONTEXT_ROOT_VALUE sub:subscript? {return {node: "$$", sub}}
   / ROOT_VALUE sub:subscript? {return {node: "$", sub}}
   / CURRENT_VALUE sub:subscript? {return {node: "@", sub}}
   / intrinsic_function

Additional validation

If the parser is able to produce an AST without any errors, then the AST is checked to see if it contains any invalid operators or functions.

The AST nodes include hints to describe the nature of the operation. All the checks could be done in a single expression, but they’re split for better error reporting.

Conclusion

  • The grammar describes the syntax for the subset of JSONPath expressions allowed in Step Functions.
  • The JSON schemas in asl-validator use a custom format to identify the type of expression required.
  • The parser produces an AST if the expression is valid.
  • Regular schema validation through AJV will call our validator on each string with one of our known format values and report back if string was invalid.
 Date: July 11, 2022
 Tags: 

Previous
⏪ Step Functions To Plantuml

Next
Revisiting the jaxb-visitor ⏩