A Brief History Of Parsing

The primary goal a parser is to organize a sequence of tokens based on the rules of a formal language. As the parser accepts a sequence of tokens, it determines, based on this information, when the grammar’s respective rules are complete and verifies the syntactic correctness of the token sequence. The end result of the process is a “derivation” which represents the token sequence organized following the rules of the grammar.

Typically, Backus-Naur Form is used to define the context free grammar used by the language. The entire language, as a whole, is represented through a single nonterminal called the “start symbol”. Often the parse information is stored into a tree, called a derivation tree, where the start symbol is the root node.

There are two distinct approaches currently used to implement parsers. Recursive Descent Parsers and LL parsers are examples of top-down parsers and LR parsers are examples of bottom-up parsers. Most parser generators, such as YACC, use one of the LR algorithm variants.

Grammar Embedded in the Code

The earliest compilers were written with the definition of the langauge buried deeply within the code. With these compilers it was very difficult to verify that the compiler accepted all of the langauge syntax and only the language syntax. This became especially difficult when the definition of the language was changed for later versions. All compilers before the early 1960’s were of this type because there wasn’t any uniform method of describing the language grammars.

Recursive Descent

With the advent of the BNF notation for describing the languages, compiler writers designed the structure of their subroutines and functions to correspond to the structure of the BNF definition of the language. To use our example grammar above, there would be seperate functions to handle EXP’s, TERM’s, and FACTOR’s. The EXP function would call itself and the TERM function, etc. This way, when it came time to update the language to meet changing standards, it would be easier to find where the changes should be made. It also made it much easier to verify that the language accepted all legal syntax and only the legal syntax.

The Recursive Descent does not guarantee that the program matches the grammar. It only aids in making it easier for the compiler writer to try to verify the accuracy of the parser. The search for better parsing methods continued with some that analyzed the grammars and attempted to automate the parsing methods. The first such method was called Top Down Parsing or LL Parsing.

Top Down Parsing

The top down parsing method, also called a predictive parse or LL parse, requires that we reorganize the grammar so that the first symbol of each rule defining a given Non-Terminal will indicate which rule to choose for the Non-Terminal. This transformation can be done to any grammar, but is sometimes awkward. There are also some cases that cannot be parsed correctly with Top Down Parsing methods.

Bottom UP Parsing

The bottom up parse, also called an LR parse is the most powerful parseing method. It also has the most complicated set of algorithms for building the parse tables. There are a set of algorithms for building LR parse tables. The same algorithm is used for all of the LR parse tables.

LR(0)

The first of the LR parse generation algorithms is called LR(0) and it generates tables that are somewhat large. The LR(0) parse algorithm do not parse grammars with certain types of ambiguities.

LR(1)

The second algorithm, which handles all of the grammars correctly is LR(1). The LR(1) algorithm generates correct parse tables for grammars with all of the ambiguities that are found in most useful langauges. The biggest strike against LR(1) parse tables is the fact that the tables generated are much larger then the LR(0).

LR Parsing, or Left-to-right Right-derivation parsing, uses tables to determine when a rule is complete and when additional tokens must be read from the source string. LR parsers identify substrings which can be reduced to nonterminals. Unlike recursive descent parsers, LR parsers do very little “thinking” at runtime. All decisions are based on the content of the parse tables.

LR parser generators construct these tables by analyzing the grammar and determining all the possible “states” the system can have when parsing. Each state represents a point in the parse process where a number of tokens have been read from the source string and rules are in different states of completion. Each production in a state of completion is called a “configuration” and each state corresponds to a configuration set. Each configuration contains a “cursor” which represents the point where the production is complete.

SLR

The third algoritm attempts to handle some of the amiguities that LR(0) fails at and keeps the size of the parse tables the same as those generated by LR(0). It is called Simple LR.

LALR

The last algorithm, Look Ahead LR, generates parse tables that are the same size of LR(0), but handles all of the ambiguities that are handled by LR(1).

“LALR Parsing, or “Lookahead LR parsing”, is a variant of LR Parsing which most parser generators, such as YACC, implement. LR Parsing combines related “configuration sets” thereby limiting the size of the parse tables. As a result, the algorithm is slightly less powerful than LR Parsing but much more practical.

Grammars that can be parsed by the LR algorithm, might not be able to be parsed by the LALR algorithm. However, this is very rarely the case and real-world examples are few. The number of states eliminated by choosing LALR over LR is sometimes huge. The C programming language, for instance, has over 10,000 LR states. LALR drops this number to around 350.

Typically, the LR / LALR parsing algorithms, like deterministic finite automata, are commonly represented by using a graph – albeit a more complex variant. For each token received from the scanner, the LR algorithm can take four different actions: Shift, Reduce, Accept and Goto.

diagram-lalr-state For each state, the LR algorithm checks the next token on the input queue against all tokens that expected at that stage of the parse. If the token is expected, it is “shifted”. This action represents moving the cursor past the current token. The token is removed form the input queue and pushed onto the parse stack.

A reduce is performed when a rule is complete and ready to be replaced by the single nonterminal it represents. Essentially, the tokens that are part of the rule’s handle – the right-hand side of the definition – are popped from the parse stack and replaced by the rule’s nonterminal plus additional information including the current state in the LR state machine.

When a rule is reduced, the algorithm jumps to (gotos) the appropriate state representing the reduced nonterminal. This simulates the shifting of a nonterminal in the LR state machine.

Finally, when the start symbol itself is reduced, the input is both complete and correct. At this point, parsing terminates.”

http://www.devincook.com/goldparser/concepts/lalr.htm

Good Runthrough of parsing. http://www.cs.wpi.edu/~kal/courses/compilers/module3/mybuparsing.html

Parsing