I have defined a new module for MetaWrap which consists of 4 Logical groups of classes. MwParser, MwParserBNF* and MwParserCST and MwParserAST.
All the new W3 Specs seem to be heading towards shared definitions using BNF. (eh XQuery, XPath2.0 and XSLT2.0) . These new modules are a way by which code within MetaWrap that implements the parsing and interpretation these specifications can self generate parsers and Abstract Syntax Trees while sharing a majority of the infrustrature.
By following this path it should be possible to shrink the W3 parsers into a single code base so that parsing code is not replicated as it is at the moment.
My initial approach is to just implement stub classes here so that the data flow can be worked out properly and I can gather the rest of the requirements. My main aim is still to develop XPath as quickly as possible.
MwParserEBNF
MwParserEBNF implements a EBNF Grammar. This consists of a group of classes that can be overridden to specialise the parsing and generation of Abstract Syntax Trees. It is the responsibility of the specialised MwParserEBNF* classes to define what is generated in the final AST. The implementations of the MwParserEBNF* classes contain hints (Grammar Notes) to aid in tokenisation.
A.1.1 Grammar NotesThis section contains general notes on the EBNF productions, which may be helpful in understanding how to create a parser based on this EBNF, how to read the EBNF, and generally call out issues with the syntax. The notes below are referenced from the right side of the production, with the notation: /* gn: <id> */.
|
Notes from XPath2.0 Specification about grammar hints
Definition : Context Free Grammar |
Good simple eplanation of LL vs LR of parsing here.
MwParser implements a parser and is initialized with a BNF. A MwParser can then take input and build a MwCST (Concrete Syntax Tree) or a MwAST (Abstract Syntax Tree). If The BNF classes are specialised then they can generate a specialized and or decorated AST/CST nodes. Parsing implemented by the walking of state tables. (fig 1) generated from the target grammar.
MwParserCST imlements an Concrete Syntax Tree which is the output of MwParser and its construction is performed by specialised MwBNF* classes.
Definition: Concrete Syntax Tree A parse tree is a record of the rules (and tokens) used to match some input text. A parse tree is a labeled tree in which the interior nodes represent nonterminals, leaf nodes represent terminals, and the children of interior nodes represent the replacement of the associated nonterminal A concrete syntax tree contains all the text that has been parsed, so also all the white space (spaces, tabs, newlines) and code comments. The original text can be completely reproduced from a concrete syntax tree. |
MwParserAbstractor translates the Concrete Syntax Tree to an Abstract Syntax Tree
MwParserAST Implements an Abstract Syntax tree.
ASTs are superior because they record the structure of the input not the artifacts of your grammar; consequently they are insensitive to changes in your input grammar, easier to manipulate during translation, and are much smaller.
There are two kinds of parse trees: (1) Abstract Syntax Tree eliminates non-interesting terminals and non-terminals, and (2) Concrete Syntax Tree preserves all symbols and exactly represents the input source. (fig 2)
Definition: Abstract Syntax Tree
|
A C++ class based parser that uses a runtime generated BNF tree suffers in speed in that it must search a grammar tree to identify which token is next instead of being able to use a tightly constrained switch operation.
So for the following BNF Grammar
Binary ::= Zero | One; Zero ::= ‘0’; One ::= ‘1’; |
Lets say the next character is ‘1’
In a grammar tree when calling Binary::parse(char p c) we need to ask both Zero and One if ‘1’ matches them. For small numbers of terms it should be reasonably fast. But we always have the option of emitting custom C++ classes and implementations of Binary::parse(char p c).
Initial Requirements include…
A generic MwParser should be able to take in a MwBNFGrammar and build an MwAST.
A generic MwParser should be able to take in a MwBNFGrammar and emit code for a specialised MwParser that can build the MwAST.
MwBNFGrammar must be able to intelligently handle constants such that we dont build dumb AST. For example
number ::= {dgt} [‘.’ {dgt}] [(‘e’|’E’) [‘-‘] {dgt}];
|
For input 1.3e-8
The AST tree for this should be
number(1.3e-8)
and not
number(dgt(1),.,dgt(3),e,-, dgt(8))
So we really have two levels of BNF tokenisation. We somehow gave to mark certain assignments as being atomic constants such that if all they consist of is constants, then we are just building a string. The boundary between these levels defines where Lexical processing and parsing begins.
But my initial requirements are simply what can get me parsing simple XPath ASAP using the minimum part of this framework such that it can grow as I require it.