Parsing And Protocols
There are really two types of information packing in protocols and grammars and these boil down to data being length or lexically delimited. A parser implements a certain grammar. The grammar parses a certain pattern of tokens into memory. How do you know when you have completed parsing a token from the grammar? Either the grammar defines an explicit length for the token or it defines a beginning and end signifier (either through a lower order lexical grammar or through a dictionary)
For example
Length Explicit
[5][H][e][l][l][o]
From this we have
Hello
In this grammar we start with the first byte defining the number of bytes to follow. This is my most favorite method of packing because it allows you to allocate the minimum amount of memory on the receiving end.. which in MetaWrap is often a thin client . This runs into problems if you want to pack things in a hierarchy. Eg.
[12][5][H][e][l][l][o][5][W][o][r][l][d]
When preparing this for transmission, you cant send the first byte until you know the length of every element in the hierarchy.
Lexical
[][h][e][l][l][o][]
We have a starting character for a string and a terminating character. We start with the []and keep grabing data until we get the [].We dont know when we will get the [].So we cant prepare the correct buffer length.
This is why I prefer Length Explicit packing and grammars. They scale better. This is why MetaWrap uses this as its primary data packing and transmission format.
One Or Two Passes?
The XPath 1.0 Specification states that
Expressions are parsed by first dividing the character string to be parsed into tokens and then parsing the resulting sequence of tokens. Whitespace can be freely used between tokens. The tokenization process is described in [3.7 Lexical Structure]. |
This seems to suggest that it should be performed in two passes, but from reading the rest of the specification, I see no reason why it can’t be performed in one pass.
Update:
The XPath 2.0 Specification is a bit more explicit
During the static analysis phase, the XPath expression is parsed into an internal representation called the operation tree (step SQ1 in Figure 1). A parse error is raised as a static error.[err:XP0003] The static context is initialized by the implementation (step SQ2). The static context is used to resolve type names, function names, namespace prefixes and variable names. |
XPath 1.0 and 2.0 are very different creatures. Xpath 1.0 is a path based grammar similar to a URI, and can in fact be parsed in a single pass without too much effort. XPath 2.0 however is much more complicated and is a fully fledged language in its own right and as such probably deserves a multiple pass parser.
Some background for anyone who’s interested:
Formal language theory and parsing techniques taught in computer science dictate that parsing should be done in several stages.
The first stage is usually to transform the context free grammar (BNF/EBNF) which represents the language to be parsed so that it minimises lookahead, the ideal outcome being that the grammar becomes LL(1), or single lookahead. An LL(1) grammar can be used to write a recursive descent parser or used in conjunction with an automated tool to produce a table driven parser.
The next stage, called lexical analysis breaks down the input into tokens which can be handled more easily by the later stages of the parser. The lexer often consists of a scanner which scans for symbols, and a recogniser which tracks the state of the symbols found so far and outputs tokens. The programmer uses the tokens to produce an abstract syntax tree (AST) which represents the information that has been parsed.
The AST is then used to perform syntaxctical, and semantic analysis. After each stage the AST may be modified, pruned, or optimised in various ways until eventually it is used to either emit code (perhaps assembler or bytecode) or create an in memory representation (object model) of the information that was parsed.
The formal processes are unfortunately often a very inefficent way of implementing a parser, but there are however huge benefits that can be gained by breaking the process down into multiple passes. A language compiler which implements a several pass process is much easier to maintain since the definition of the language can change without necesitating a complete re-write, often only the tokeniser or scanner need change. Similarly, the AST can be decorated, optimised, or used to emit many different forms of output depending on the requirements of the platform.
The W3C has recently gotten much better at providing good grammers for their standards, infact some (such as XPath 2.0) are LL(1) already and can be immediately translated into code.
http://www-106.ibm.com/developerworks/java/library/x-javacc1.html
http://www.cs.may.ie/~jpower/Courses/parsing/
http://www.cse.unsw.edu.au/~billw/cs9414/notes/nlp/grampars.html
http://www.cs.usfca.edu/~parrt/course/652/lectures/formal.language.html
Cool Thanks for that. My standard approach has always been to parse/lex and build the AST in one pass and then to interpret or emit based on the AST.
I’m thinking that what the world needs now is something that can parse a W3 SPEC – extract all the BNF and build a set of C++ classes for parsing and building the AST for that spec. 🙂