 Monday, December 04, 2006
This is old news, but I'm thinking of getting into the next stage of this project. My "state table only" parser started passing all 70 of its test-cases some time in the middle of this year.
I've just been too busy to blog about it. I'm getting a few things out of the way before some other projects shift into gear.

Or for something more amusing...
http://blog.metawrap.com/blog/content/binary/MetaWrap/Parser/test66.dot.png
I'll be applying this to real time XML schema parsing and validation, GL shader construction, protocol generators and data-flow based computing.
My algorithm builds a near optimal state transition table in one pass - the next step is adding some of the standard modifiers such as the Kleene Closure and other standard EBNF unary operations. The KC will be the hardest because its at the root of one of the hardest greedy grammar problems.

( a | b )* abb
The others will be fairly trivial. My KC solution will not be perfect; I'm limiting myself on purpose. Its not my aim to end up with yet another general solution. I want the fastest for my problem space. I'm limiting myself to grammars that don't contain the ambiguities that lead to greediness. My theory is that they don't end up in real world situations often, if they do they can be worked around by a simple re-factoring, or are sign that the grammar is messy.
More about this in previous posts on the topic.
 Friday, June 24, 2005
[Rescued from the old blog - Thanks to Damian]
I'm trawling through conventional theory at the moment to see if my approach shows up on the radar anywhere - if it does then I can probably get some hints and tips on what could make it better or if I am doing something wrong.
My algorithm skips the whole NFA -> DFA transformation stage and completes the build of what should be an optimal DFA in what I hope will be one pass through the grammar. This is only applicable to context free grammars where each lexeme is unambiguous.
The data structure for I use maps transitions into groups of states. This is evident in my description diagrams where there is a more than one dot notation for the applicable lexeme eg. http://set-top.net/tests/dot/test13.png . The aim has always been to preserve the state structure and lexeme context. More importantly it allows transitions to be reduced by effectively allowing the overloading of transitions, for each state group is effectively polymorphic as one or more potential states.
I have reasons for this approach that will become self evident when I explain the endgame of the implementation.
This is the first of three posts....
Finite State Automata In General
Formally, an automaton is represented by 5tupple
- Q is a finite set of states.
- ∑ is a finite set of symbols, that we will call the alphabet of the language the automaton accepts.
- δ is the transition function, that is
- S0 is the start state, that is, the state in which the automaton is when no input has been processed yet (Obviously, S0∈ Q).
- F is a set of states of Q (i.e. F⊂Q), called accept states.
With all this, we can now say that the language L accepted by an DFA automaton M is:
language.png
Deterministic Finite State Automata (DFA)
The term deterministic can mean two things:
- for recognizers, it means that for each state there cannot be more than one out transitions with the same label, there is one initial state, and there are no epsilon transitions;
- for transducers, it usually means that in addition there can be no two out transitions with the same label on the "input" level, and different labels on the "output" level.
A DFA has no ε transitions
DFA’s transition function is single-valued
For a DFA, the action of automaton on each input symbol is fully determined which provides an obvious table-driven implementation.
The following state diagram represents the regular language grammar ( a | b )* abb as a deterministic finite state automata. DFA.

( a | b )* abb
We can see that this is deterministic because
- All states have unambiguous transitions (from state to state, for a given character you will always end up in the same state).
- No ε (empty string) transitions
Non-deterministic Finite State Automata (NFA)
A NFA has..
-
A set of states a start state and at least one accepting state.
-
Arrows connecting states labeled by input symbols, or ε (which does not consume input)
-
More than one arrow leaving a state may have same label
For an NFA..
- an automaton may have choice on each step
- automaton accepts a string if there is any way to make choices to arrive at accepting state / every path from start state to an accept state is a string accepted by automaton
- not obvious how to implement efficiently!
The following state diagram represents the regular language ( a | b )* abb as a non-deterministic finite state automata (NFA).

( a | b )* abb
We can see that this is non deterministic because of the following.
Alternating Finite Automata (AFA)
In automata theory, an alternating finite automaton (AFA) is a non-deterministic finite automaton whose transitions are divided into existential and universal transitions. Let A be an alternating automaton.
- For a transition afat1.png, A nondeterministically chooses to switch the state to either q1 or q2, reading a.
- For a transition afat2.png, A moves to q1 and q2, reading a.
Note that due to the universal quantification a run is represented by a run tree. A accepts a word w, if there exists a run tree on w such that every path ends in an accepting state.
A basic theorem tells that any AFA is equivalent to an non-deterministic finite automaton (NFA) by performing a similar kind of powerset construction as it is used for the transformation of a NFA to a deterministic finite automaton (DFA). This construction converts an AFA with k states to a NFA with up to 2k states.
Finite State Transducers (FST)
A finite state transducer (FST) is a finite state machine with two tapes.
A finite state transducer is defined in the same way as an FSM except that:
- In addition to the input alphabet, an output alphabet is specified
- Typically there is no notion of acceptance of a string, so no accepting states are specified
- Each transition has an associated output string (i.e., string over the output alphabet)
A finite state transducer is drawn in the same way as an FSM except that:
- Typically there is no notion of acceptance of a string, so no state is designated as accepting
- Each transition is marked with a symbol or set of symbols as in the case of an FSM, followed by a forward slash, followed by an output string
Termination is decided not on a reaching a particular state, but on a particular transition between states.
Contrast this with an ordinary finite state automaton, which has a single tape. An automaton can be said to recognize a string if we view the content of its tape as input. In other words, the automaton computes a function that maps strings into the set {0,1}. Alternatively, we can say that an automaton generates strings, which means viewing its tape as an output tape. On this view, the automaton generates a formal language, which is a set of strings. The two views of automata are equivalent: the function that the automaton computes is precisely the characteristic function of the set of strings it recognized. The class of languages generated by finite automata is known as the class of regular languages.
The two tapes of a transducer are typically viewed as an input tape and an output tape. On this view, a transducer is said to transduce (i.e., translate) the contents of its input tape to its output tape, by accepting a string on its input tape and generating another string on its output tape. It may do so nondeterministically and it may produce more than one output for each input string. A transducer may also produce no output for a given input string, in which case it is said to reject the input. In general, a transducer computes a relation between two formal languages. The class of relations computed by finite state transducers is known as the class of rational relations.
Finite State Transducers are typically useful in NLP (Natural Language Processing) research.

(a | b)* aab
Completes on output of 1
Normal Non-deterministic Finite State Automata (NNFA)
A normal NFA (abbr. NNFA) is an NFA in which all edges leading into the same state have the same label. Thus, it is convenient to label states instead of edges. Visually this has a strong correpondence with the language is recognises.

(a | b) * abb
Compressed Normal Non-deterministic Finite State Automata (CNNFA)
http://www.cs.nyu.edu/web/Research/Theses/chang_chia-hsiang.pdf

(a | b) * abb
Comparison between NFAs and DFAs
NFAs are more compact - they generally require fewer states to recognize a language.
It obvious that a DFA can be simulated with an NFA, because the set of all DFAs is a subset of the set of all NFAs. (Proof)
Its less obvious that every NFA can be simulated with a DFA, and NFA has an equivalent DFA. It is possible to convert an existing NFA into a DFA for the purpose of implementing a simpler machine. This can be performed using the powerset construction which may lead to an exponential raise in the number of necessary states. Such determinization however is necessary to build a complement automaton.
[UPDATE] - corrected an image that I mixed up when resizing them. The First NFA image was a repeat of the DFA.
I'm trawling through conventional theory at the moment to see if my approach shows up on the radar anywhere - if it does then I can probably get some hints and tips on what could make it better or if I am doing something wrong.
This is the second of three posts.
Conventional wisdom is that you perform the following steps.
-
Write down the regular language for the input grammar.
-
-
-
Systematically shrink the DFA.
-
Turn it into code or a lookup table.
Step 1) Write Down The Regular Language
( a | b )* abb
Step 2) Build An NFA (Using Thompson's Construction aka Inductive Construction)
For a Given Regular language, Building the NFA is generally done using Thompson's construction.
First parse a regular language into constituent subexpressions.
eg. for ( a | b )* abb you would produce the following expression tree...

Now traverse the tree depth first and construct the NFA using the following rules.
a) For ε (no input) construct the following pattern

b) For a single character, 'a' construct the following pattern.

Now we get into the rules for compositing TFA's together with the various logical operations.
c) For constructing a composite NFA from two pre-existing NFAs s and t in the form of (s | t) . Let NFA(s) and NFA(t) be the NFAs for regular expressions s and t. Construct the following pattern.

s|t
Newly added nodes are in orange.
d) For constructing a composite NFA from two pre-existing NFAs s and t in the form of st . Let NFA(s) and NFA(t) be the NFAs for regular expressions s and t. Construct the following pattern.

st
Merge final state of NFA(s) and initial state of NFA(t) (in orange)
e) For constructing an NFA in the patten of a Kleene Star (a sequence of zero or more elements) from an existing NFAs s in the form of s* . Let NFA(s) be the NFA for the regular expressions s. Construct the following pattern.

s*
Newly added nodes are in orange.
f) For constructing an NFA in the patten of a Kleene Cross (a sequence of at least one element) from an existing NFAs s in the form of s+ . Let NFA(s) be the NFA for the regular expressions s. Construct the following pattern.

s+
Newly added nodes are in orange.
g) For constructing an NFA in the patten of an option (zero or one element) from an existing NFAs s in the form of s? . Let NFA(s) be the NFA for the regular expressions s. Construct the following pattern.

s?
Newly added nodes are in orange.
Our example language of ( a | b ) * abb would parse as

( a | b ) * abb
However the following is what you get if a human designs it.

A more dramatic example of the inefficiency is the NFA generated for a ( b | c ) * would look like

a ( b | c ) *
Generated by Thompson's Method
However the following is what you get if a human designs it.

a ( b | c ) *
Generated by Common Sense
But its good to know we can automate the generation - even if the result is not optimal.
Step 3) Convert the NFA into a DFA (Determinisation)
A Commonly used algorithm - subset construction.
From the point of view of the input, any two states that are connected by an ε-transition may as well be the same, since we can move from one to the other without consuming any character. Thus states which are connected by an ε-transition will be represented by the same states in the DFA.
If it is possible to have multiple transitions based on the same symbol, then we can regard a transition on a symbol as moving from a state to a set of states (ie. the union of all those states reachable by a transition on the current symbol). Thus these states will be combined into a single DFA state.
Ni is a state from the source NFA
∑ is a finite set of symbols, that we will call the alphabet of the language the automaton accepts
Move( Si , a ) is the set of states reachable from Si by a. This function takes a state and a character, and returns the set of states reachable by one transition on this character.
ε-closure( Si ) is the set of states reachable from Si by ε . This function takes a state and returns the set of states reachable from it based on (one or more) ε-transitions. Note that this will always include the state tself. We should be able to get from a state to any state in its ε-closure without consuming any input.
Start state derived from S0 of the NFA Take ε-closure(S0) Take the image of S0, Move(S0, ά) for each ά ∈ Σ, and take its ε-closure Iterate until no more states are added
S0 := ε-closure({N0}) S := { S0 } W := { S0 } while ( W != Ø ) select and remove s from W for each ά ∈Σ t := ε-closure(Move( s,ά )) T[ s , ά ] := t if ( t not in S ) then add t to S add t to W
Output is a set of states S and a set of transitions T.
Worst case is an exponential increase in the number of states and transitions.
A list of conventional algorithms.
Ted Leslie, Efficient Approaches to Subset Construction, masters thesis, Computer Science, University of Waterloo, 1995.[bibtex]
J. Howard Johnson and Derick Wood, Instruction Computation in Subset Construction, in Automata Implementation, Darrell Raymond and Derick Wood and Sheng Yu (eds.), Lecture Notes in Computer Science 1260, pp. 64-71, Springer Verlag, 1997.[bibtex]
Gertjan van Noord, The Treatment of Epsilon Moves in Subset Construction, Computational Linguistics, 26(1), pp. 61-76, April 2000. Available from www.let.rug.nl/~vannoord/papers/[bibtex]
Mehryar Mohri, Finite-State Transducers in Language and Speech Processing, Computational Linguistics, 23(2), pp. 269-311, June 1997.[bibtex]
Mehryar Mohri, On some applications of finite-state automata theory to natural language processing, Natural Language Engineering, vol. 2, pp. 61-80, 1996. Originally appeared in 1994 as Technical Report, institut Gaspard Monge, Paris.[bibtex]
Mehryar Mohri and Fernando C.N. Pereira and Michael Riley, A Rational Design for a Weighted Finite-State Transducer Library, Automata Implementation. Second International Workshop on Implementing Automata, WIA '97, Lecture Notes in Computer Science 1436, Springer Verlag, 1998.[bibtex]
Emmanuel Roche and Yves Schabes, Deterministic Part-of-Speech Tagging with Finite-State Transducers, Computational Linguistics, 21(2), pp. 227-253, June, 1995.[bibtex]
Step 4) Systematicaly Shrink The DFA (Minimisatiion)
A list of conventional algorithms.
The fastest (O(|Q|log|Q|), where |Q| is the number of states) minimization algorithm (by John E. Hopcroft) is described in:
John E. Hopcroft, An n log n algorithm for minimizing the states in a finite automaton, in The Theory of Machines and Computations, Z. Kohavi (ed.), pp. 189-196, Academic Press, 1971.[bibtex]
David Gries, Describing an Algorithm by Hopcroft, Acta Informatica, vol. 2, pp. 97-109, 1973.[bibtex]
A. V. Aho, J. E. Hopcroft and J. D. Ullman, The Design and Analysis of Computer Algorithms, Addison-Wesley Publishing Company, 1974.[bibtex]
The third item on the list (i.e. the book) is probably the easiest to find, but beware that the description is somewhat hidden. Look for "partitioning". The problem of minimizing states in the set of states of an automaton is equivalent to finding the coarsest partition of the set of states, consistent with the initial partition into final and non-final states, such that if states s and t are in one block of the partition, then states delta(s, a) and delta(t, a) are also in one block of the partition for each input symbol a.
An algorithm that is considered by some people to be equally fast for frequently encountered data is Brzozowski's algorithm. It minimizes an automaton by performing reversal and determinization twice. The subset construction (determinization) is O(2|Q|), where Q is a set of states, but it depends on input (see master thesis by Ted Leslie in the Determinization section). The second paper here describes complexity of operations needed to implement Brzozowski's algorithm.
J. A. Brzozowski, Canonical regular expressions and minimal state graphs for definite events, in Mathematical Theory of Automata, Volume 12 of MRI Symposia Series, pp. 529-561, Polytechnic Press, Polytechnic Institute of Brooklyn, N.Y., 1962.[bibtex]
C. Câmpeanu, K. Culik II, and Sheng Yu, State Complexity of Basic Oprations on Finite Languages, utomata Implementation. Proceedings of 4th International Workshop on Implementing Automata, WIA'99, Oliver Boldt and Helmut Jürgensen (Eds.), Postdam, Germany, LNCS 2214, Springer, 2001.[bibtex]
The following algorithm is different from all others because partial results can be used. It tests pairs of states for equivalence. The first paper describes a version with exponential time complexity. The second paper adds modifications that make it almost O(|Q|2).
Bruce W. Watson, An incremental DFA minimization algorithm, Finite State Methods in Natural Language Processing 2001, ESSLLI Workshop, August 20-24, Helsinki, Finland, 2001.[bibtex]
Bruce B. Watson, Jan Daciuk, An efficient incremental DFA minimization algorithm, Natural Language Engineering, Volume 9, Issue 01, March 2003. [bibtex]
The Hopcroft-Ullman is one of the best known. Its complexity is O(|Q|2), and it also requires O(|Q|2) memory, so it does not seem to be suitable for large automata. Pairs of states are considered, and the automaton has to be complete (i.e. it must have a sink, or dead state). A pair of states is distinguished if one of the states in the pair is final, and the other non-final. A pair of states is also distinguished if transitions labeled with the same symbol lead to another pair of states that was found distinguishable. For each pair of states, a list of pairs is found that are targets of transitions. Initially, pairs containing one final, and one non-final state are marked as distinguishable. Then all pairs that are on lists associated with those pairs are marked, and so on until no new pairs are marked. Pairs that are not marked are equivalent.
John E. Hopcroft and Jefferey D. Ullman, Introduction to Automata Theory, Languages, and Computation, Adison-Wesley Publishing Company, Reading, Massachusets, USA, 1979.[bibtex]
The Aho-Sethi-Ullman algorithm has the same complexity as the Hopcroft-Ullman algorithm, i.e. O(|Q|2), but it needs only O(|Q|) memory. As in the Hopcroft algorithm, the set of states is partitioned, and the blocks are refined in each step, until no more divisions occur. Two states are put into different blocks if they have transitions leading to states in different partitions. In difference to the Hopcroft algorithm, forward transitions are used for that purpose (the Hopcroft algorithm uses back transitions to split blocks with transitions leading to the current block).
Alfred V. Aho and Ravi Sethi and Jeffrey D. Ullman, Compilers. Principles, Techniques and Tools, Addison Wesley, 1986.[bibtex]
For acyclic automata, there are better methods than those given above. The minimization part of incremental and semi-incremental construction algorithms can be used.
Transducers
Mehryar Mohri, Compact Representations by Finite-State Transducers, ACL'94, Morgan Kaufmann, San Francisco, California, USA, 1994.[bibtex]
Step 5) Generate Code
Trivial Exercise
[Rescued from the old blog - Thanks to Damian]
The algorithm and data structure are designed to build a near optimal deterministic finite state automata for a regular language with a single pass, unlike the current three pass 'Thompson Process'.
But even I was shocked when it spat this DFA out for the regular language ('A' | 'A' 'B') 'D'.
DFA encapsulated NFA for ('A' | 'A' 'B') 'D'.
('A' | 'A' 'B') 'D' can also be expressed as 'A' ['B'] 'D' - and it is this reduction that has been built.
If I was to take the constructed DFA and convert it back into a regular language I would in effect get the simplified 'A' ['B'] 'D'
Instead of being slightly more split worldpath wise, the convergence really seems to be very strict.
The cardinal semantics of the state group is slightly different than I expected. The second state clearly contains three states, one of which is mutually exclusive to the other two - it essentially describes the 'states that you may be at this group state for'. I will try a thought experiment and try and work out if I can maintain a strict meaning of
 Monday, January 24, 2005
Speaking of old bands - one of my old bad techno songs has been released on http://www.openpodcast.org as The Rhinoceros Song - OpenPodcast.org #759 aka "Mastitis" - people used to just call it "The Rhinoceros Song" so it just kind of stuck :)
The Vogon stood up. "No, well you're completely wrong," he said, "I just write poetry music to throw my mean callous heartless exterior into sharp relief. ...
Been staring at the parser code out of the corner of eye for the last week. I have a fair idea of what needs to be done. I need a pattern that can keep track of the last state of all MwW3ParserLexerStates as they thread through all the MwW3ParserLexerStateGroups - the trick being that the last State in all State theads within a set of optional lexemes needs to be remembered so that the subsequent lexemes of approprate logical precedence can add their first State as an option into each of them and converge/reduce onto a Group for that lexeme.
Makes perfect sense to me at least .. I'm in the process of checking that each of these steps is feasible.
 Friday, January 14, 2005
 Thursday, January 13, 2005
I am now dual headed. We will see how it goes. I have the G4 and PC second display on the second monitor. The G4 stays up as long as I don't move much. (Long Story)
New test case - fixed a few bugs tonight. Testing all the basic stuff before I go any further.
http://www.metawrap.com/tests/dot/test14.png
 Wednesday, January 12, 2005
The million dollar question is...
Should a construction which consists of a simple option like "ABBB" | "CBBB" have a parse table that compresses and merges the last sequence of states into common groups?
http://www.metawrap.com/tests/dot/test12.png
In the same way that the start sequence is merged in "ABCDEZ" | "ABCDEX" This testcase demonstrates a clear reason for having the terminating state objects.
http://www.metawrap.com/tests/dot/test13.png
How about something like "ABBB" | "CBB"
I must sleep and ponder on the practicality. It would result in a more compressed parse table.. but at what cost?
 Tuesday, November 09, 2004
 Thursday, November 04, 2004
http://www.metawrap.com/tests/dot/test9.png
Minimal testcase for simplest production I could think of - with an at that un unhandled visitor type. (now implemented of course). The wonderful thing about applying such a complicated visitor pattern to this type of problem is that even though the first type is hard and laborious, the rest seem to quickly drop into place.
space => ' '
My misssion tonight - get rid of that superflouous end group before the terminator.
But this
space => '_' ' '
http://www.metawrap.com/tests/dot/test10.png
is of course just plain wrong.
This is better.
http://www.metawrap.com/tests/dot/test11.png
But we are left with the problem, How do we know that we are the last character of the lexeme and the last lexeme of a production with no further substitutions - and thus a potential terminator ε. This is non trivial to compute at every symbol. It may be better to gather all these dangling groups as part of garbage collect and make them all point to a single terminator group. I will investigate both paths and see what works best. But now for some sleep.
 Saturday, October 30, 2004
A formal grammar is an abstract structure that describes a formal language precisely: i.e., a set of rules that mathematically delineates a (usually infinite) set of finite-length strings over a (usually finite) alphabet.
A formal language is a set of finite-length character strings drawn from some finite alphabet.
A context-free grammar (CFG) is a formal grammar in which every production rule is of the form
- V → w
where V is a non-terminal symbol and w is a string consisting of terminals and/or non-terminals. The term "context-free" comes from the fact that the non-terminal V can always be replaced by w, regardless of in what context it occurs.
A formal language is context-free if there is a context-free grammar that generates it.
terminals belong to the set of strings of the alphabet of the language. They do not get replaced during the derivation of a sentence.
non-terminals or variables or syntactic-categories are replaced during the derivation. They represent arbitrary strings.
As the parser accepts a sequence of tokens, it determines, based on this information, when the grammar's respective rules are complete and verifies the syntactic correctness of the token sequence. The end result of the process is a derivation which represents the token sequence organized following the rules of the grammar.
A derivation of a string is a sequence of steps where nonterminals are expanded so that the string results. S ->
"if" E "then" S "else" S ->
"if" "2" "=" "2" "then" S "else" S ->
"if" "2" "=" "2" "then" "print" E "else" S ->
"if" "2" "=" "2" "then" "print" "2" "=" "2" "else" S ->
"if" "2" "=" "2" "then" "print" "2" "=" "2" "else" "print" E ->
"if" "2" "=" "2" "then" "print" "2" "=" "2" "else" "print" "2" "=" "2"
A leftmost derivation is one where every step expands the leftmost nonterminal. A rightmost derivation expands the rightmost nonterminal.
A derivation need not be rightmost or leftmost. Many derivations can result in the same syntax tree.
A derivation starting from the start symbol gives a member of the language (after all variables have been eliminated)
Every context gree language has a start symbol.
The start symbol represents all valid sentences in the language
The head (of a production) a single variable for context free langiuages in a similar notation for non-context-free languages, the head may contain more than one symbol. The body (of the production) a sequence of symbols, variables as well as terminals are allowed.
To infer that a sentence is in the language
Recursive Inference:
From body to head - Start from terminals and work up the tree to the start symbol
Derivation:
From head to body - Start from the start symbol and apply productions until you get the sentence
Regular Languages are a subset of Context Free Languages All regular languages are also context free
Definition of a Context Free Grammar
G = {V,T,P,S}
V = Finite set of variables
nT = Finite set of terminals
nP = Set of productions
S = Start symbol
For language
L = { 0n1n | n >= 1}
S -> 0 X 1
X -> ε
X -> 0 X 1
The grammar is
G = ( {S,X} , {0,1} , {(S,0X1),(X,ε),(X,0X1)},S )
The ancient Indian linguist Panini described Sanskrit using a context-free grammar. Recently, it has been suggested that a class of Tamil poetry called Venpa is governed by a context-free grammar.
Parsing
Its pretty obvious from this failed experiment that my original plan was the correct one.
http://set-top.net/tests/dot/test7.png
By associating an exact description of the potential state, but then using transitions wrt. the group of states, I've created something that makes logical sense given the underlying invisible group context - but provides a visual distortion. Even though the dot notation ."sy*ntax" | "synth" signifes that we have just parsed the first two characters of syntax, a potental transition via 'n' to ."syntax" | "syn*th" is shown which flips the path to another lexeme being the current focus. It should really show something like "sy*ntax" | "sy*nth"
This group based view is much better and shows the above state as "sy*ntax" | "synth" and "syntax" | "sy*nth"
http://set-top.net/tests/dot/test8.png
Now that I have a basic visualiser, its time to start using it as it was intended, to help develop and debug the rest of the parse table generator.
Looks like I'm creating an extra group at the end before the terminator. I really need to have a sentinel which is a terminator symbol (phi?).
Another good website on parsing theory.
http://www.cs.chalmers.se/Cs/Grundutb/Kurser/komp/2004/lectures/ccc05.html
 Monday, October 18, 2004
 Wednesday, September 22, 2004
I'm working on the garbage collector for my w3 spec parse table generator and from the first time run of its mark pass you can see the structure of the parse table for the production rule
word := "syntax" | "synth"
----------- build completed -------------
/==================== MwW3ParserLexerStateGroup::mark ===============================================
| /==================== MwW3ParserParsetable::Dump =================================================
| | 0073 's' 1
| \-------------------- MwW3ParserParsetable::Dump -------------------------------------------------
| /==================== MwW3ParserParsetable::mark =================================================
| | /==================== MwW3ParserLexerStateGroup::mark =========================================
| | | mark 1
| | | /==================== MwW3ParserLexerStateGroup::mark ======================================
| | | | /==================== MwW3ParserParsetable::Dump ========================================
| | | | | 0079 'y' 1
| | | | \-------------------- MwW3ParserParsetable::Dump ----------------------------------------
| | | | /==================== MwW3ParserParsetable::mark ========================================
| | | | | /==================== MwW3ParserLexerStateGroup::mark ================================
| | | | | | mark 1
| | | | | | /==================== MwW3ParserLexerStateGroup::mark =============================
| | | | | | | /==================== MwW3ParserParsetable::Dump ===============================
| | | | | | | | 006E 'n' 1
| | | | | | | \-------------------- MwW3ParserParsetable::Dump -------------------------------
| | | | | | | /==================== MwW3ParserParsetable::mark ===============================
| | | | | | | | /==================== MwW3ParserLexerStateGroup::mark =======================
| | | | | | | | | mark 1
| | | | | | | | | /==================== MwW3ParserLexerStateGroup::mark ====================
| | | | | | | | | | /==================== MwW3ParserParsetable::Dump ======================
| | | | | | | | | | | stack depth: 1688 bytes
| | | | | | | | | | | 0074 't' 1
| | | | | | | | | | \-------------------- MwW3ParserParsetable::Dump ----------------------
| | | | | | | | | | /==================== MwW3ParserParsetable::mark ======================
| | | | | | | | | | | /==================== MwW3ParserLexerStateGroup::mark ==============
| | | | | | | | | | | | mark 1
| | | | | | | | | | | | /==================== MwW3ParserLexerStateGroup::mark ===========
| | | | | | | | | | | | | /==================== MwW3ParserParsetable::Dump =============
| | | | | | | | | | | | | | 0061 'a' 0
| | | | | | | | | | | | | | 0068 'h' 1
| | | | | | | | | | | | | \-------------------- MwW3ParserParsetable::Dump -------------
| | | | | | | | | | | | | /==================== MwW3ParserParsetable::mark =============
| | | | | | | | | | | | | | /==================== MwW3ParserLexerStateGroup::mark =====
| | | | | | | | | | | | | | | mark 0
| | | | | | | | | | | | | | | /==================== MwW3ParserLexerStateGroup::mark ==
| | | | | | | | | | | | | | | | /==================== MwW3ParserParsetable::Dump ====
| | | | | | | | | | | | | | | | | 0078 'x' 0
| | | | | | | | | | | | | | | | \-------------------- MwW3ParserParsetable::Dump ----
| | | | | | | | | | | | | | | \-------------------- MwW3ParserLexerStateGroup::mark --
| | | | | | | | | | | | | | \-------------------- MwW3ParserLexerStateGroup::mark -----
| | | | | | | | | | | | | \-------------------- MwW3ParserParsetable::mark -------------
| | | | | | | | | | | | \-------------------- MwW3ParserLexerStateGroup::mark -----------
| | | | | | | | | | | \-------------------- MwW3ParserLexerStateGroup::mark --------------
| | | | | | | | | | \-------------------- MwW3ParserParsetable::mark ----------------------
| | | | | | | | | \-------------------- MwW3ParserLexerStateGroup::mark --------------------
| | | | | | | | \-------------------- MwW3ParserLexerStateGroup::mark -----------------------
| | | | | | | \-------------------- MwW3ParserParsetable::mark -------------------------------
| | | | | | \-------------------- MwW3ParserLexerStateGroup::mark -----------------------------
| | | | | \-------------------- MwW3ParserLexerStateGroup::mark --------------------------------
| | | | \-------------------- MwW3ParserParsetable::mark ----------------------------------------
| | | \-------------------- MwW3ParserLexerStateGroup::mark --------------------------------------
| | \-------------------- MwW3ParserLexerStateGroup::mark -----------------------------------------
| \-------------------- MwW3ParserParsetable::mark -------------------------------------------------
\-------------------- MwW3ParserLexerStateGroup::mark -----------------------------------------------
----------- garbage collect completed --------------------------------------------------------------- |
* note to self - never open blog editor in two windows - bad things happen.. I just lost a document from a few months ago that I was grabbing some HTML from - luckily it was still in the browser history.
 Thursday, August 26, 2004
 Tuesday, August 24, 2004
The primary goal a parser is to organize a sequence of tokens based on the rules of a formal language. As the parser accepts a sequence of tokens, it determines, based on this information, when the grammar's respective rules are complete and verifies the syntactic correctness of the token sequence. The end result of the process is a "derivation" which represents the token sequence organized following the rules of the grammar.
Typically, Backus-Naur Form is used to define the context free grammar used by the language. The entire language, as a whole, is represented through a single nonterminal called the "start symbol". Often the parse information is stored into a tree, called a derivation tree, where the start symbol is the root node.
There are two distinct approaches currently used to implement parsers. Recursive Descent Parsers and LL parsers are examples of top-down parsers and LR parsers are examples of bottom-up parsers. Most parser generators, such as YACC, use one of the LR algorithm variants.
Grammar Embedded in the Code
The earliest compilers were written with the definition of the langauge buried deeply within the code. With these compilers it was very difficult to verify that the compiler accepted all of the langauge syntax and only the language syntax. This became especially difficult when the definition of the language was changed for later versions. All compilers before the early 1960's were of this type because there wasn't any uniform method of describing the language grammars.
Recursive Descent
With the advent of the BNF notation for describing the languages, compiler writers designed the structure of their subroutines and functions to correspond to the structure of the BNF definition of the language. To use our example grammar above, there would be seperate functions to handle EXP's, TERM's, and FACTOR's. The EXP function would call itself and the TERM function, etc. This way, when it came time to update the language to meet changing standards, it would be easier to find where the changes should be made. It also made it much easier to verify that the language accepted all legal syntax and only the legal syntax.
The Recursive Descent does not guarantee that the program matches the grammar. It only aids in making it easier for the compiler writer to try to verify the accuracy of the parser. The search for better parsing methods continued with some that analyzed the grammars and attempted to automate the parsing methods. The first such method was called Top Down Parsing or LL Parsing.
Top Down Parsing
The top down parsing method, also called a predictive parse or LL parse, requires that we reorganize the grammar so that the first symbol of each rule defining a given Non-Terminal will indicate which rule to choose for the Non-Terminal. This transformation can be done to any grammar, but is sometimes awkward. There are also some cases that cannot be parsed correctly with Top Down Parsing methods.
Bottom UP Parsing
The bottom up parse, also called an LR parse is the most powerful parseing method. It also has the most complicated set of algorithms for building the parse tables. There are a set of algorithms for building LR parse tables. The same algorithm is used for all of the LR parse tables.
LR(0)
The first of the LR parse generation algorithms is called LR(0) and it generates tables that are somewhat large. The LR(0) parse algorithm do not parse grammars with certain types of ambiguities.
LR(1)
The second algorithm, which handles all of the grammars correctly is LR(1). The LR(1) algorithm generates correct parse tables for grammars with all of the ambiguities that are found in most useful langauges. The biggest strike against LR(1) parse tables is the fact that the tables generated are much larger then the LR(0). LR Parsing, or Left-to-right Right-derivation parsing, uses tables to determine when a rule is complete and when additional tokens must be read from the source string. LR parsers identify substrings which can be reduced to nonterminals. Unlike recursive descent parsers, LR parsers do very little "thinking" at runtime. All decisions are based on the content of the parse tables.
LR parser generators construct these tables by analyzing the grammar and determining all the possible "states" the system can have when parsing. Each state represents a point in the parse process where a number of tokens have been read from the source string and rules are in different states of completion. Each production in a state of completion is called a "configuration" and each state corresponds to a configuration set. Each configuration contains a "cursor" which represents the point where the production is complete.
SLR
The third algoritm attempts to handle some of the amiguities that LR(0) fails at and keeps the size of the parse tables the same as those generated by LR(0). It is called Simple LR.
LALR
The last algorithm, Look Ahead LR, generates parse tables that are the same size of LR(0), but handles all of the ambiguities that are handled by LR(1).
"LALR Parsing, or "Lookahead LR parsing", is a variant of LR Parsing which most parser generators, such as YACC, implement. LR Parsing combines related "configuration sets" thereby limiting the size of the parse tables. As a result, the algorithm is slightly less powerful than LR Parsing but much more practical.
Grammars that can be parsed by the LR algorithm, might not be able to be parsed by the LALR algorithm. However, this is very rarely the case and real-world examples are few. The number of states eliminated by choosing LALR over LR is sometimes huge. The C programming language, for instance, has over 10,000 LR states. LALR drops this number to around 350.
Typically, the LR / LALR parsing algorithms, like deterministic finite automata, are commonly represented by using a graph - albeit a more complex variant. For each token received from the scanner, the LR algorithm can take four different actions: Shift, Reduce, Accept and Goto.
For each state, the LR algorithm checks the next token on the input queue against all tokens that expected at that stage of the parse. If the token is expected, it is "shifted". This action represents moving the cursor past the current token. The token is removed form the input queue and pushed onto the parse stack.
A reduce is performed when a rule is complete and ready to be replaced by the single nonterminal it represents. Essentially, the tokens that are part of the rule's handle - the right-hand side of the definition - are popped from the parse stack and replaced by the rule's nonterminal plus additional information including the current state in the LR state machine.
When a rule is reduced, the algorithm jumps to (gotos) the appropriate state representing the reduced nonterminal. This simulates the shifting of a nonterminal in the LR state machine.
Finally, when the start symbol itself is reduced, the input is both complete and correct. At this point, parsing terminates."
http://www.devincook.com/goldparser/concepts/lalr.htm
Good Runthrough of parsing. http://www.cs.wpi.edu/~kal/courses/compilers/module3/mybuparsing.html Parsing
 Thursday, July 01, 2004

Data Flow Diagram For W3 Parsing Infrustructure
Red denotes what I will be building as first stage to get to the functionality I need for XPath
Blask denotes what I want to add later.
I have defined a new module for MetaWrap which consists of 4 Logical groups of classes. MwParser, MwParserBNF* and MwParserCST and MwParserAST.
All the new W3 Specs seem to be heading towards shared definitions using BNF. (eh XQuery, XPath2.0 and XSLT2.0) . These new modules are a way by which code within MetaWrap that implements the parsing and interpretation these specifications can self generate parsers and Abstract Syntax Trees while sharing a majority of the infrustrature.
By following this path it should be possible to shrink the W3 parsers into a single code base so that parsing code is not replicated as it is at the moment.
My initial approach is to just implement stub classes here so that the data flow can be worked out properly and I can gather the rest of the requirements. My main aim is still to develop XPath as quickly as possible.
MwParserEBNF
MwParserEBNF implements a EBNF Grammar. This consists of a group of classes that can be overridden to specialise the parsing and generation of Abstract Syntax Trees. It is the responsibility of the specialised MwParserEBNF* classes to define what is generated in the final AST. The implementations of the MwParserEBNF* classes contain hints (Grammar Notes) to aid in tokenisation.
A.1.1 Grammar Notes
This section contains general notes on the EBNF productions, which may be helpful in understanding how to create a parser based on this EBNF, how to read the EBNF, and generally call out issues with the syntax. The notes below are referenced from the right side of the production, with the notation: /* gn: <id> */.
- grammar-note: parens
-
A look-ahead of one character is required to distinguish function patterns from a QName followed by a comment. For example: address (: this may be empty :) may be mistaken for a call to a function named "address" unless this lookahead is employed.
- grammar-note: lt
-
Token disambiguation of the overloaded "<" pattern is defined in terms of positional lexical states. The "<" comparison operator can not occur in the same places as a "<" tag open pattern. The "<" comparison operator can only occur in the OPERATOR state and the "<" tag open pattern can only occur in the DEFAULT state. (These states are only a specification tool, and do not mandate an implementation strategy for this same effect.)
- grammar-note: leading-lone-slash
-
The "/" presents an issue because it occurs both in a leading position and an operator position in expressions. Thus, expressions such as "/ * 5" can easily be confused with the path expression "/*". Therefore, a stand-alone slash, in a leading position, that is followed by an operator, will need to be parenthesized in order to stand alone, as in "(/) * 5". "5 * /", on the other hand, is fine.
- grammar-note: comments
-
Expression comments are allowed inside expressions everywhere that ignorable white space is allowed.
- grammar-note: xml-version
-
The general rules for [XML 1.1] vs. [XML 1.0], as described in the A.2 Lexical structure section, should be applied to this production. |
Notes from XPath2.0 Specification about grammar hints
Good simple eplanation of LL vs LR of parsing here.
MwParser implements a parser and is initialized with a BNF. A MwParser can then take input and build a MwCST (Concrete Syntax Tree) or a MwAST (Abstract Syntax Tree). If The BNF classes are specialised then they can generate a specialized and or decorated AST/CST nodes. Parsing implemented by the walking of state tables. (fig 1) generated from the target grammar.
MwParserCST imlements an Concrete Syntax Tree which is the output of MwParser and its construction is performed by specialised MwBNF* classes.
|
Definition: Concrete Syntax Tree
A parse tree is a record of the rules (and tokens) used to match some input text.
A parse tree is a labeled tree in which the interior nodes represent nonterminals, leaf nodes represent terminals, and the children of interior nodes represent the replacement of the associated nonterminal
A concrete syntax tree contains all the text that has been parsed, so also all the white space (spaces, tabs, newlines) and code comments. The original text can be completely reproduced from a concrete syntax tree. |
MwParserAbstractor translates the Concrete Syntax Tree to an Abstract Syntax Tree
MwParserAST Implements an Abstract Syntax tree.
ASTs are superior because they record the structure of the input not the artifacts of your grammar; consequently they are insensitive to changes in your input grammar, easier to manipulate during translation, and are much smaller.
There are two kinds of parse trees: (1) Abstract Syntax Tree eliminates non-interesting terminals and non-terminals, and (2) Concrete Syntax Tree preserves all symbols and exactly represents the input source. (fig 2)
|
Definition: Abstract Syntax Tree A abstract syntax tree records the structure of the input and is insensitive to the grammar that produced it. It is the role of an AST to provide a clean interface between parser and later phases.
|
A C++ class based parser that uses a runtime generated BNF tree suffers in speed in that it must search a grammar tree to identify which token is next instead of being able to use a tightly constrained switch operation.
So for the following BNF Grammar
|
Binary ::= Zero | One;
Zero ::= '0';
One ::= '1'; |
Lets say the next character is '1'
In a grammar tree when calling Binary::parse(char p c) we need to ask both Zero and One if '1' matches them. For small numbers of terms it should be reasonably fast. But we always have the option of emitting custom C++ classes and implementations of Binary::parse(char p c).
Initial Requirements include...
A generic MwParser should be able to take in a MwBNFGrammar and build an MwAST.
A generic MwParser should be able to take in a MwBNFGrammar and emit code for a specialised MwParser that can build the MwAST.
MwBNFGrammar must be able to intelligently handle constants such that we don’t build dumb AST. For example
|
number ::= {dgt} ['.' {dgt}] [('e'|'E') ['-'] {dgt}];
dgt ::= '0'|'1'|'2'|'3'|'4'|'5'|'6'|'7'|'8'|'9';
|
For input “1.3e-8”
The AST tree for this should be
number(“1.3e-8”)
and not
number(dgt(1),’.’,dgt(3),’e’,’-’, dgt(8))
So we really have two levels of BNF tokenisation. We somehow gave to mark certain assignments as being atomic constants such that if all they consist of is constants, then we are just building a string. The boundary between these levels defines where Lexical processing and parsing begins.
But my initial requirements are simply what can get me parsing simple XPath ASAP using the minimum part of this framework such that it can grow as I require it.
 Tuesday, June 29, 2004
Parsing And Protocols
There are really two types of information packing in protocols and grammars and these boil down to data being length or lexically delimited. A parser implements a certain grammar. The grammar parses a certain pattern of tokens into memory. How do you know when you have completed parsing a token from the grammar? Either the grammar defines an explicit length for the token or it defines a beginning and end signifier (either through a lower order lexical grammar or through a dictionary)
For example
Length Explicit
[5][H][e][l][l][o]
From this we have
Hello
In this grammar we start with the first byte defining the number of bytes to follow. This is my most favorite method of packing because it allows you to allocate the minimum amount of memory on the receiving end.. which in MetaWrap is often a thin client . This runs into problems if you want to pack things in a hierarchy. Eg.
[12][5][H][e][l][l][o][5][W][o][r][l][d]
When preparing this for transmission, you can’t send the first byte until you know the length of every element in the hierarchy.
Lexical
[“][h][e][l][l][o][”]
We have a starting character for a string and a terminating character. We start with the [“]and keep grabing data until we get the [”].We don’t know when we will get the [”].So we can’t prepare the correct buffer length.
This is why I prefer Length Explicit packing and grammars. They scale better. This is why MetaWrap uses this as its primary data packing and transmission format.
One Or Two Passes?
The XPath 1.0 Specification states that
|
Expressions are parsed by first dividing the character string to be parsed into tokens and then parsing the resulting sequence of tokens. Whitespace can be freely used between tokens. The tokenization process is described in [3.7 Lexical Structure]. |
This seems to suggest that it should be performed in two passes, but from reading the rest of the specification, I see no reason why it can't be performed in one pass.
Update:
The XPath 2.0 Specification is a bit more explicit
|
During the static analysis phase, the XPath expression is parsed into an internal representation called the operation tree (step SQ1 in Figure 1). A parse error is raised as a static error.[err:XP0003] The static context is initialized by the implementation (step SQ2). The static context is used to resolve type names, function names, namespace prefixes and variable names. |
© Copyright 2009 James Mc Parlane
Theme design by Bryan Bell
newtelligence dasBlog 1.9.7174.0  | |  | Page rendered at Sunday, July 05, 2009 7:10:52 AM (AUS Eastern Standard Time, UTC+10:00)
|
On this page....
| | Sun | Mon | Tue | Wed | Thu | Fri | Sat |
|---|
| 28 | 29 | 30 | 1 | 2 | 3 | 4 | | 5 | 6 | 7 | 8 | 9 | 10 | 11 | | 12 | 13 | 14 | 15 | 16 | 17 | 18 | | 19 | 20 | 21 | 22 | 23 | 24 | 25 | | 26 | 27 | 28 | 29 | 30 | 31 | 1 | | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
Search
Navigation
Categories
Blogroll
Sign In
|