Thursday, September 22, 2005

Extending the expression parser

Earlier, I wrote about a C# parser for simple expressions. To see that I'm not just poormouthing, consider a couple of apparently simple examples that are beyond its reach:

  • f(1)+a+b+m(x,y)
  • ff(d)+a+b+mm(x,y)

If we write down an equivalent EBNF grammar, the sources of the limitations are clear:

  <entire input> ::= <expression> <end of input>
  <expression>   ::= <term> + <expression> | <term>
  <term>         ::= <factor> * <term> | <factor>
  <factor>       ::= IDENTIFIER [ ( <expression> ) ]
                   | ( <expression> )

Note that the grammar doesn't recognize integers as factors, only handles multiplication and addition, and only handles functions of one argument!

The good news is that these problems are easy to remedy. First, we need to teach the parser that commas are operators:

Lexer l = new Lexer(
  expr,
  new Token(Tokens.Terminator, @";\n*|\n+"),
  new Token(Tokens.Integer,    @"\b\d+\b"),
  new Token(Tokens.Identifier, @"[A-Za-z_]\w*"),
  new Token(Tokens.Operator,   @"\*\*|[-=+*/(),]"),
  new Token(Tokens.Whitespace, @"\s+", Lexer.Ignore));

To recognize literal integers as factors and parse functions of arbitrarily many arguments, we adjust factor:

  Parser open  = new LookFor(Tokens.Operator, "(");
  Parser close = new LookFor(Tokens.Operator, ")");
  Parser comma = new LookFor(Tokens.Operator, ",");

  Parser arglist = new Concatenate(
      open,
      exprstub, new Star(new Concatenate(comma, exprstub)),
      close);

  factor = new Alternate(
      new T(
          new Concatenate(
              new LookFor(Tokens.Identifier),
              new Alternate(arglist, new Nothing())),
          new ValuesTransform(TagVarOrFunction)),
      new T(new Concatenate(open, exprstub, close),
            new ValuesTransform(StripParens)),
      new LookFor(Tokens.Integer));

If you don't care about operator precedence, you might do away with the term/factor distinction to make adding subtraction, division, exponentiation, and other operators trivial.

With these changes, the parser can now handle the examples above and relative doozies such as f1(f2(a,b), f3(f4(x,y,z))).

No comments: