Shared Infrustrature for W3 Specification Parsing – Part I

I have defined a new module for MetaWrap which consists of 4 Logical groups of classes. MwParser, MwParserBNF* and MwParserCST and MwParserAST.

All the new W3 Specs seem to be heading towards shared definitions using BNF. (eh XQuery, XPath2.0 and XSLT2.0) . These new modules are a way by which code within MetaWrap that implements the parsing and interpretation these specifications can self generate parsers and Abstract Syntax Trees while sharing a majority of the infrustrature.

By following this path it should be possible to shrink the W3 parsers into a single code base so that parsing code is not replicated as it is at the moment.

My initial approach is to just implement stub classes here so that the data flow can be worked out properly and I can gather the rest of the requirements. My main aim is still to develop XPath as quickly as possible.

MwParserEBNF

MwParserEBNF implements a EBNF Grammar. This consists of a group of classes that can be overridden to specialise the parsing and generation of Abstract Syntax Trees. It is the responsibility of the specialised MwParserEBNF* classes to define what is generated in the final AST. The implementations of the MwParserEBNF* classes contain hints (Grammar Notes) to aid in tokenisation.

A.1.1 Grammar Notes

This section contains general notes on the EBNF productions, which may be helpful in understanding how to create a parser based on this EBNF, how to read the EBNF, and generally call out issues with the syntax. The notes below are referenced from the right side of the production, with the notation: /* gn: <id> */.

grammar-note: parens: A look-ahead of one character is required to distinguish function patterns from a QName followed by a comment. For example: address (: this may be empty :) may be mistaken for a call to a function named “address” unless this lookahead is employed.
grammar-note: lt: Token disambiguation of the overloaded “<” pattern is defined in terms of positional lexical states. The “<” comparison operator can not occur in the same places as a “<” tag open pattern. The “<” comparison operator can only occur in the OPERATOR state and the “<” tag open pattern can only occur in the DEFAULT state. (These states are only a specification tool, and do not mandate an implementation strategy for this same effect.)
grammar-note: leading-lone-slash: The “/” presents an issue because it occurs both in a leading position and an operator position in expressions. Thus, expressions such as “/ * 5” can easily be confused with the path expression “/*”. Therefore, a stand-alone slash, in a leading position, that is followed by an operator, will need to be parenthesized in order to stand alone, as in “(/) * 5”. “5 * /”, on the other hand, is fine.
grammar-note: comments: Expression comments are allowed inside expressions everywhere that ignorable white space is allowed.
grammar-note: xml-version: The general rules for [XML 1.1] vs. [XML 1.0], as described in the A.2 Lexical structure section, should be applied to this production.

Notes from XPath2.0 Specification about grammar hints

Definition : Context Free Grammar

http://www.fact-index.com/c/co/context_free_grammar.html

Good simple eplanation of LL vs LR of parsing here.

MwParser implements a parser and is initialized with a BNF. A MwParser can then take input and build a MwCST (Concrete Syntax Tree) or a MwAST (Abstract Syntax Tree). If The BNF classes are specialised then they can generate a specialized and or decorated AST/CST nodes. Parsing implemented by the walking of state tables. (fig 1) generated from the target grammar.

MwParserCST imlements an Concrete Syntax Tree which is the output of MwParser and its construction is performed by specialised MwBNF* classes.

Definition: Concrete Syntax Tree

A parse tree is a record of the rules (and tokens) used to match some input text.

A parse tree is a labeled tree in which the interior nodes represent nonterminals, leaf nodes represent terminals, and the children of interior nodes represent the replacement of the associated nonterminal

A concrete syntax tree contains all the text that has been parsed, so also all the white space (spaces, tabs, newlines) and code comments. The original text can be completely reproduced from a concrete syntax tree.

MwParserAbstractor translates the Concrete Syntax Tree to an Abstract Syntax Tree

MwParserAST Implements an Abstract Syntax tree.

ASTs are superior because they record the structure of the input not the artifacts of your grammar; consequently they are insensitive to changes in your input grammar, easier to manipulate during translation, and are much smaller.

There are two kinds of parse trees: (1) Abstract Syntax Tree eliminates non-interesting terminals and non-terminals, and (2) Concrete Syntax Tree preserves all symbols and exactly represents the input source. (fig 2)

Definition: Abstract Syntax Tree

A abstract syntax tree records the structure of the input and is insensitive to the grammar that produced it.

It is the role of an AST to provide a clean interface between parser and later phases.

A C++ class based parser that uses a runtime generated BNF tree suffers in speed in that it must search a grammar tree to identify which token is next instead of being able to use a tightly constrained switch operation.

So for the following BNF Grammar

Binary ::= Zero | One;

Zero ::= ‘0’;

One ::= ‘1’;

Lets say the next character is ‘1’

In a grammar tree when calling Binary::parse(char p c) we need to ask both Zero and One if ‘1’ matches them. For small numbers of terms it should be reasonably fast. But we always have the option of emitting custom C++ classes and implementations of Binary::parse(char p c).

Initial Requirements include…

A generic MwParser should be able to take in a MwBNFGrammar and build an MwAST.

A generic MwParser should be able to take in a MwBNFGrammar and emit code for a specialised MwParser that can build the MwAST.

MwBNFGrammar must be able to intelligently handle constants such that we dont build dumb AST. For example

number ::= {dgt} [‘.’ {dgt}] [(‘e’|’E’) [‘-‘] {dgt}];

dgt ::= ‘0’|’1’|’2’|’3’|’4’|’5’|’6’|’7’|’8’|’9′;

For input 1.3e-8

The AST tree for this should be

number(1.3e-8)

and not

number(dgt(1),.,dgt(3),e,-, dgt(8))

So we really have two levels of BNF tokenisation. We somehow gave to mark certain assignments as being atomic constants such that if all they consist of is constants, then we are just building a string. The boundary between these levels defines where Lexical processing and parsing begins.

But my initial requirements are simply what can get me parsing simple XPath ASAP using the minimum part of this framework such that it can grow as I require it.