Parsing complex search queries

banner

Parsing complex search queries

An important factor in a system like Seams-CMS is the fact we need to be able to let users search for data. Such search queries could be simple:

query all entries that start with "foo".

But they could also become very complex:

query all entries that start with "foo" and are tagged with either "news" or "featured" or have at least 10 comments but no more than 25, and is published at most 1 year ago.

Creating a system that can deal with such complex questions sounds like a daunting task, but is, in fact, a relatively simple problem: we treat our search queries as a custom programming language.

Lexing and parsing

Creating a custom programming language just for search queries sounds like overkill but it's not. Most querying systems, like SQL or ElasticSearch Query language or any system, deal with this the same way. The magic solution is called lexing and parsing.

To make things work, we need to do these four steps:

  • Step 1: Identify tokens from our search query (lexing)
  • Step 2: Analyze the tokens and make sure they adhere to our ruleset (parsing)
  • Step 3: Create an Abstract Syntax Tree (AST) from these tokens.
  • Step 4: Use the AST for building your search.

This, in fact, is exactly what pretty much every programming language does too: from your source code, it tries to decipher different tokens (keywords, strings, variables, etc) and makes sure these tokens follows the rules of the programming language. Next, it will convert these tokens into an Abstract Syntax Tree, which is a representation of your program that can be easily processed or executed.

Since our search queries are handled by an API written in PHP, we need to find a way to lex and parse in PHP. Fortunately, there is already a common library available for this: the Doctrine Lexer. It is mostly used for parsing and generating the DQL language (doctrine query language) but can be easily used for lexing and parsing your custom language. Sweet!

Backus normal form

Before we can start with these steps, we must create a blueprint for our query language. This is completely up to you and pretty much the sky is the limit. Do however make sure your rules are easy to follow otherwise you not only confuse yourself and must create a complex parser, but it will also confuse your users.

One way of creating such a blueprint is with the help of the Backus Normal Form (BNF). This is a standardized way of writing your rules. Some lexer/parser systems can generate parsers directly from this, but in our case, we need to create a parser manually. Again a reason to make the language as simple as possible.

(note that the following is not completely in BNF notation, but readable enough)

 query := simple_query_expression |
          query AND simple_query_expression |
          query OR simple_query_expression
 
 simpleQueryExpression := identifier operator operand | '(' identifier operator operand ')'
 
 identifier := [a-zA-Z0-9_.]+
 
 operator := eq|ne|gt|le|lt|in|ni|contains
 
 operand := scalar | array
 
 array := '[' array_value ']'
 array_value := scalar | scalar ',' array_value
 
 scalar := qstring | string | int | date
 qstring := '"' [^"]+ '"'
 string := [^\s]+
 int := [0-9]+
 date := ISO8601 datetime

Our language consists of a query which can be either a simple_query_expression OR a query followed by the keyword AND followed by another simplequeryexpression, OR a query followed by the keyword OR followed by another simplequeryexpression. This allows us to create complex AND/OR queries.

A simplequeryexpression is nothing more than an identifier followed by an operator followed by an operand, OR the same thing but withing parenthesis. This means that ("foo" eq "bar") and "foo" eq "bar" are the same thing.

An identifier is nothing more than a word that matches the [a-zA-Z0-9_]+ regex. Some languages prefer to have identifiers starting with a letter first for instance, but in our case, it doesn't matter.

Operators are words: eq, ne, gte, gt etc.. We could have opted for operators like == !=, > etc but since we are sending these search queries mostly through URLs, it would be easier to read them as things like == would be URL-escaped.

Scalars, strings, qstrings, int, and dates are also relatively simple tokens.

Arrays are anything between brackets: [ "foo", 4 ], [ "foo", "bar", "baz" ] etc.

Lexing

Now that we have created our language, it's time to start with the lexer.

<?php

namespace SeamsCMS\Query\Filter;

use Doctrine\Common\Lexer\AbstractLexer;

class Lexer extends AbstractLexer
{
    // All tokens that are not valid identifiers must be < 100
    public const T_NONE              = 1;
    public const T_NUMBER            = 2;
    public const T_STRING            = 3;
    public const T_OPEN_PARENTHESIS  = 4;
    public const T_CLOSE_PARENTHESIS = 5;
    public const T_OPEN_BLOCK        = 6;
    public const T_CLOSE_BLOCK       = 7;
    public const T_COMMA             = 8;
    public const T_DATE              = 9;

    // All keyword tokens should be >= 200
    public const T_AND      = 200;
    public const T_OR       = 201;
    public const T_EQ       = 202;
    public const T_NE       = 203;
    public const T_LT       = 204;
    public const T_LTE      = 205;
    public const T_GT       = 206;
    public const T_GTE      = 207;
    public const T_IN       = 208;
    public const T_NI       = 209;
    public const T_CONTAINS = 210;

    /**
     * {@inheritdoc}
     */
    protected function getCatchablePatterns()
    {
        return [
            '[a-z0-9_.]*',    // Identifiers
            '(?:[0-9]+)?',    // numbers
            '(?:"[^"]+")',    // Double quoted strings
        ];
    }

    /**
     * {@inheritdoc}
     */
    protected function getType(&$value)
    {
        $type = self::T_NONE;

        switch (true) {
            // Recognize numeric values
            case (is_numeric($value)):
                return self::T_NUMBER;

            // Recognize quoted strings
            case ($value[0] === "\""):
                $value = str_replace("\"\"", "\"", substr($value, 1, strlen($value) - 2));

                if (is_numeric($value)) {
                    // Quoted numbers are still numbers 
                    return self::T_NUMBER;
                }

                // See if we are a quoted date
                try {
                    $ret = new \DateTime($value, new \DateTimeZone("UTC"));
                    if ($ret instanceof \DateTime) {
                        return self::T_DATE;
                    }
                } catch (\Exception $e) {
                    // It's ok when it's not a date.
                }

                return self::T_STRING;

            // Recognize identifiers, aliased or qualified names
            case (ctype_alpha($value[0])):
                $name = 'SeamsCMS\\Query\\Filter\\Lexer::T_' . strtoupper($value);

                if (defined($name)) {
                    $type = constant($name);

                    if ($type > 100) {
                        return $type;
                    }
                }

                return self::T_STRING;

            // Recognize symbols
            case ($value === '('):
                return self::T_OPEN_PARENTHESIS;

            case ($value === ')'):
                return self::T_CLOSE_PARENTHESIS;

            case ($value === '['):
                return self::T_OPEN_BLOCK;

            case ($value === ']'):
                return self::T_CLOSE_BLOCK;

            case ($value === ','):
                return self::T_COMMA;

            // Default
            default:
                // Do nothing
        }

        return $type;
    }
}

We define symbols (constants) for each token we want to identify. This means that whenever we detect a , in our language, the lexer will return a T_COMMA (unless the comma was inside a quoted string, in which case it would return a T_STRING.

There is a little bit of magic going on detecting the operators: when we discover the string "eq", the lexer checks if there is a constant called T_EQ, and if so, it will use that token, otherwise the string is just a string. This is here so we don't have to define each operator separately.

Another tricky thing is that we check quoted strings for dates. If the string can be parsed as a DateTime, we consider the string as a T_DATE, otherwise, it's just a T_STRING.

Parsing the tokens

Now that we can lex our language into tokens, it's time to make sense of these tokens. According to our BNF, a simple_query_expression must start with a T_IDENTIFIER followed by a T_OPERATOR. The parser is where this syntax checks will occur and where it will return errors identifying the error (for instance: it can return "operator expected" when a T_IDENTIFIER is followed by a T_DATE instead of a T_OPERATOR.

Here is a snippet of our parser:

    /**
     *  simpleQueryFactor:= identifier operator operand | '(' QueryExpression ')'
     *
     * @return Node
     * @throws ParseException
     */
    public function simpleQueryFactor(): Node
    {
        $lookaheadType = $this->lexer->lookahead['type'];
        if ($lookaheadType === Lexer::T_OPEN_PARENTHESIS) {
            $this->lexer->moveNext();

            $expr = $this->QueryExpression();
            $this->mustMatch(Lexer::T_CLOSE_PARENTHESIS);
            return $expr;
        }

        $identifier = $this->identifierExpression();
        $operator = $this->operatorExpression();

        switch ($operator->getOperator()) {
            case 'in':
            case 'ni':
                // In and NI must be arrays
                $this->mustMatch(Lexer::T_OPEN_BLOCK);
                $operand = $this->arrayExpression();
                $this->mustMatch(Lexer::T_CLOSE_BLOCK);
                break;
            case 'gt':
            case 'gte':
            case 'lt':
            case 'lte':
                // These can only be numbers or dates, not strings
                $operand = $this->numberOrDateExpression();
                break;
            case 'contains':
            case 'eq':
            case 'ne':
            default:
                $operand = $this->operandExpression();
                break;
        }

        return new SimpleQueryNode($identifier, $operator, $operand);
    }

    /**
     * identifier := [A-Z0-9_.]+
     *
     * @return IdentifierNode
     * @throws ParseException
     */
    public function identifierExpression(): IdentifierNode
    {
        $this->mustMatch(Lexer::T_STRING, 'identifier');
        $identifier = $this->lexer->token['value'];

        return new IdentifierNode($identifier);
    }

(note that the BNF is a bit different that we discussed). Inside the simpleQueryFactor function we check the syntax for a simple_query_expression and generate an AST node. We check for parenthesis and if found, we parse the expression in between. The "mustMatch" function checks if the next token is actually a T_OPEN_PARENTHESIS or T_CLOSE_PARENTHESIS so it will throw a syntax error when we forget a closing ), or have an uneven number of open and close parenthesis.

The syntax rules for a query depends a bit depending on each operator: the in and ni operator MUST have an array as an operand, while gt, gte etc must have either a number or a date.

If everything follows our rules, we return a SimpleQueryNode which is an AST node with an identifier, an operator and operand.

the identifierExpression is a bit simpler: it will check if the value matches a string (we don't have separate tokens for identifiers), and returns an identifier AST node.

AST nodes

The AST nodes itself are simple data objects without any logic inside. They are present to represent the abstract syntax tree:

class IdentifierNode extends Node
{
    /** @var string */
    protected $value;

    /**
     * IdentifierNode constructor.
     * @param string $value
     */
    public function __construct(string $value)
    {
        $this->value = $value;
    }

    /**
     * @return string
     */
    public function getValue(): string
    {
        return $this->value;
    }
}
<?php
class SimpleQueryNode extends Node
{
    /** @var IdentifierNode */
    protected $identifier;

    /** @var OperatorNode */
    protected $operator;

    /** @var Node */
    protected $operand;

    /**
     * SimpleQuery constructor.
     *
     * @param IdentifierNode $identifier
     * @param OperatorNode $operator
     * @param Node $operand
     */
    public function __construct(IdentifierNode $identifier, OperatorNode $operator, Node $operand)
    {
        $this->identifier = $identifier;
        $this->operator = $operator;
        $this->operand = $operand;
    }

    /**
     * @return IdentifierNode
     */
    public function getIdentifier(): IdentifierNode
    {
        return $this->identifier;
    }

    /**
     * @return OperatorNode
     */
    public function getOperator(): OperatorNode
    {
        return $this->operator;
    }

    /**
     * @return Node
     */
    public function getOperand(): Node
    {
        return $this->operand;
    }
}

Walking the abstract syntax tree

So the parsing process does two steps at once: checking the syntax of our language and generating an abstract syntax tree.

A query like this

content.full_name.value ne "John Doe" and (meta.sizelte 12000 or meta.blaat eq "foobar")

would give us the following AST:

<?php return 'AST\\QueryExpressionNode(
    "terms": array(
        "0" => AST\\SimpleQueryNode(
            "identifier": AST\\IdentifierNode(
                "value": \'content.full_name.value\',
            ),
            "operator": AST\\OperatorNode(
                "operator": \'ne\',
            ),
            "operand": AST\\LiteralNode(
                "value": \'John Doe\',
            ),
        ),
        "1" => AST\\QueryTermNode(
            "factors": array(
                "0" => AST\\SimpleQueryNode(
                    "identifier": AST\\IdentifierNode(
                        "value": \'meta.size\',
                    ),
                    "operator": AST\\OperatorNode(
                        "operator": \'lte\',
                    ),
                    "operand": AST\\LiteralNode(
                        "value": 12000,
                    ),
                ),
                "1" => AST\\SimpleQueryNode(
                    "identifier": AST\\IdentifierNode(
                        "value": \'meta.blaat\',
                    ),
                    "operator": AST\\OperatorNode(
                        "operator": \'eq\',
                    ),
                    "operand": AST\\LiteralNode(
                        "value": \'foobar\',
                    ),
                ),
            ),
        ),
    ),
)';

This is now a system that can be easily traversed over. We know that a QueryExpressionNode consists of one or multiple terms that have to be 'AND'ed with each other. A simpleQueryNode consists of an identifier, operator, and operand, which could be a literal or array, etc.

At this point, we don't care about syntax and rules because that all has been taken care of by the lexer and parser.

In our case, we use this AST to generate filters for our MongoDB data cluster, but we could just as easily use the same AST to output an ElasticSearch query or even an SQL query to fetch data from another data source. The possibilities are endless!

Conclusion

Dealing with complex queries doesn't have to be that hard. It's easy to fall into the trap of creating your own parser, but more often than not you end up with all kind of edge-cases that you may or may not handle, have issues when dealing with the complexities of subqueries, or and/or queries that may or may not be in parenthesis, etc.

Instead of writing your own parser system, use a proven and well-documented system like lexing and parsing: it's for a good reason that it's used in every (query) language out there and although it may look scary at first, you'll get the hang of it quickly.


Tags: