Monday, December 04, 2006

This is old news, but I'm thinking of getting into the next stage of this project. My "state table only" parser started passing all 70 of its test-cases some time in the middle of this year.

I've just been too busy to blog about it. I'm getting a few things out of the way before some other projects shift into gear.

 

test52.dot.png

Or for something more amusing...

http://blog.metawrap.com/blog/content/binary/MetaWrap/Parser/test66.dot.png

I'll be applying this to real time XML schema parsing and validation, GL shader construction, protocol generators and data-flow based computing.

My algorithm builds a near optimal state transition table in one pass - the next step is adding some of the standard modifiers such as the Kleene Closure and other standard EBNF unary operations. The KC will be the hardest because its at the root of one of the hardest greedy grammar problems.

DFAsmall.png

( a | b )* abb

The others will be fairly trivial. My KC solution will not be perfect; I'm limiting myself on purpose. Its not my aim to end up with yet another general solution. I want the fastest for my problem space. I'm limiting myself to grammars that don't contain the ambiguities that lead to greediness. My theory is that they don't end up in real world situations often, if they do they can be worked around by a simple re-factoring, or are sign that the grammar is messy.

 More about this in previous posts on the topic.

Monday, December 04, 2006 8:58:18 PM (AUS Eastern Standard Time, UTC+10:00)  #    Comments [0]
 Friday, June 24, 2005

[Rescued from the old blog - Thanks to Damian]

I'm trawling through conventional theory at the moment to see if my approach shows up on the radar anywhere - if it does then I can probably get some hints and tips on what could make it better or if I am doing something wrong.

My algorithm skips the whole NFA -> DFA transformation stage and completes the build of what should be an optimal DFA in what I hope will be one pass through the grammar. This is only applicable to context free grammars where each lexeme is unambiguous.

The data structure for I use maps transitions into groups of states. This is evident in my description diagrams where there is a more than one dot notation for the applicable lexeme eg. http://set-top.net/tests/dot/test13.png . The aim has always been to preserve the state structure and lexeme context. More importantly it allows transitions to be reduced by effectively allowing the overloading of transitions, for each state group is effectively polymorphic as one or more potential states.

I have reasons for this approach that will become self evident when I explain the endgame of the implementation.

This is the first of three posts....

Finite State Automata In General

Formally, an automaton is represented by 5tupple Q ∑ δ S0 F ,where:

  • Q is a finite set of states.
  • ∑ is a finite set of symbols, that we will call the alphabet of the language the automaton accepts.
  • δ is the transition function, that is
  • S0 is the start state, that is, the state in which the automaton is when no input has been processed yet (Obviously, S0∈ Q).
  • F is a set of states of Q (i.e. F⊂Q), called accept states.

With all this, we can now say that the language L accepted by an DFA automaton M is:

language.png

Deterministic Finite State Automata (DFA)

The term deterministic can mean two things:

  • for recognizers, it means that for each state there cannot be more than one out transitions with the same label, there is one initial state, and there are no epsilon transitions;
  • for transducers, it usually means that in addition there can be no two out transitions with the same label on the "input" level, and different labels on the "output" level.

A DFA has no ε transitions

DFA’s transition function is single-valued

For a DFA, the action of automaton on each input symbol is fully determined which provides an obvious table-driven implementation.

The following state diagram represents the regular language grammar ( a | b )* abb as a deterministic finite state automata. DFA.

DFAsmall.png

( a | b )* abb

We can see that this is deterministic because

  • All states have unambiguous transitions (from state to state, for a given character you will always end up in the same state).
  • No ε (empty string) transitions

Non-deterministic Finite State Automata (NFA)

A NFA has..

  • A set of states a start state and at least one accepting state.
  • Arrows connecting states labeled by input symbols, or ε (which does not consume input)
  • More than one arrow leaving a state may have same label

For an NFA..

  • an automaton may have choice on each step
  • automaton accepts a string if there is any way to make choices to arrive at accepting state / every path from start state to an accept state is a string accepted by automaton
  • not obvious how to implement efficiently!

The following state diagram represents the regular language ( a | b )* abb as a non-deterministic finite state automata (NFA).

 

 

NDFA-small.png

( a | b )* abb

We can see that this is non deterministic because of the following.

  • S0 has a transition on ε
  • S1 has two transitions on a

Alternating Finite Automata (AFA)

In automata theory, an alternating finite automaton (AFA) is a non-deterministic finite automaton whose transitions are divided into existential and universal transitions. Let A be an alternating automaton.

  • For a transition afat1.png, A nondeterministically chooses to switch the state to either q1 or q2, reading a.
  • For a transition afat2.png, A moves to q1 and q2, reading a.

Note that due to the universal quantification a run is represented by a run tree. A accepts a word w, if there exists a run tree on w such that every path ends in an accepting state.

A basic theorem tells that any AFA is equivalent to an non-deterministic finite automaton (NFA) by performing a similar kind of powerset construction as it is used for the transformation of a NFA to a deterministic finite automaton (DFA). This construction converts an AFA with k states to a NFA with up to 2k states.


Finite State Transducers (FST)

A finite state transducer (FST) is a finite state machine with two tapes.

A finite state transducer is defined in the same way as an FSM except that:

  • In addition to the input alphabet, an output alphabet is specified
  • Typically there is no notion of acceptance of a string, so no accepting states are specified
  • Each transition has an associated output string (i.e., string over the output alphabet)

A finite state transducer is drawn in the same way as an FSM except that:

  • Typically there is no notion of acceptance of a string, so no state is designated as accepting
  • Each transition is marked with a symbol or set of symbols as in the case of an FSM, followed by a forward slash, followed by an output string

Termination is decided not on a reaching a particular state, but on a particular transition between states.

Contrast this with an ordinary finite state automaton, which has a single tape. An automaton can be said to recognize a string if we view the content of its tape as input. In other words, the automaton computes a function that maps strings into the set {0,1}. Alternatively, we can say that an automaton generates strings, which means viewing its tape as an output tape. On this view, the automaton generates a formal language, which is a set of strings. The two views of automata are equivalent: the function that the automaton computes is precisely the characteristic function of the set of strings it recognized. The class of languages generated by finite automata is known as the class of regular languages.

The two tapes of a transducer are typically viewed as an input tape and an output tape. On this view, a transducer is said to transduce (i.e., translate) the contents of its input tape to its output tape, by accepting a string on its input tape and generating another string on its output tape. It may do so nondeterministically and it may produce more than one output for each input string. A transducer may also produce no output for a given input string, in which case it is said to reject the input. In general, a transducer computes a relation between two formal languages. The class of relations computed by finite state transducers is known as the class of rational relations.

Finite State Transducers are typically useful in NLP (Natural Language Processing) research.

TransDiagram.gif

(a | b)* aab

Completes on output of 1

Normal Non-deterministic Finite State Automata (NNFA)

A normal NFA (abbr. NNFA) is an NFA in which all edges leading into the same state have the same label. Thus, it is convenient to label states instead of edges. Visually this has a strong correpondence with the language is recognises.

NNFA-small.png

(a | b) * abb

Compressed Normal Non-deterministic Finite State Automata (CNNFA)

http://www.cs.nyu.edu/web/Research/Theses/chang_chia-hsiang.pdf

CNNFA.png

(a | b) * abb

Comparison between NFAs and DFAs

NFAs are more compact - they generally require fewer states to recognize a language.

It obvious that a DFA can be simulated with an NFA, because the set of all DFAs is a subset of the set of all NFAs. (Proof)

Its less obvious that every NFA can be simulated with a DFA, and NFA has an equivalent DFA. It is possible to convert an existing NFA into a DFA for the purpose of implementing a simpler machine. This can be performed using the powerset construction which may lead to an exponential raise in the number of necessary states. Such determinization however is necessary to build a complement automaton.

 

[UPDATE] - corrected an image that I mixed up when resizing them. The First NFA image was a repeat of the DFA.

Friday, June 24, 2005 8:31:33 PM (AUS Eastern Standard Time, UTC+10:00)  #    Comments [0]

I'm trawling through conventional theory at the moment to see if my approach shows up on the radar anywhere - if it does then I can probably get some hints and tips on what could make it better or if I am doing something wrong.

This is the second of three posts.

Conventional wisdom is that you perform the following steps.

  1. Write down the regular language for the input grammar.
  2. Convert the NFA into a DFA (deterministic finite state automata).
  3. Systematically shrink the DFA.
  4. Turn it into code or a lookup table.

Step 1) Write Down The Regular Language

( a | b )* abb

Step 2) Build An NFA (Using Thompson's Construction aka Inductive Construction)

For a Given Regular language, Building the NFA is generally done using Thompson's construction.

First parse a regular language into constituent subexpressions.

eg. for ( a | b )* abb you would produce the following expression tree...

 

REtree.png

 

Now traverse the tree depth first and construct the NFA using the following rules.

a) For ε (no input) construct the following pattern

thompson_rule1.png

b) For a single character, 'a' construct the following pattern.

thompson_rule2.png

 

Now we get into the rules for compositing TFA's together with the various logical operations.

c) For constructing a composite NFA from two pre-existing NFAs s and t in the form of (s | t) .  Let NFA(s) and NFA(t) be the NFAs for regular expressions s and t. Construct the following pattern.

thompson_rule3.png

s|t

Newly added nodes are in orange.

 

d) For constructing a composite NFA from two pre-existing NFAs s and t in the form of st .  Let NFA(s) and NFA(t) be the NFAs for regular expressions s and t. Construct the following pattern.

thompson_rule4.png

st

Merge final state of NFA(s) and initial state of NFA(t) (in orange)

 

 

e) For constructing an NFA in the patten of a Kleene Star (a sequence of zero or more elements) from an existing NFAs s in the form of s* .  Let NFA(s) be the NFA for the regular expressions s. Construct the following pattern.

thompson_rule5.png

s*

Newly added nodes are in orange.

 

f) For constructing an NFA in the patten of a Kleene Cross (a sequence of at least one element) from an existing NFAs s in the form of s+ .  Let NFA(s) be the NFA for the regular expressions s. Construct the following pattern.

 

thompson_rule6.png

s+

Newly added nodes are in orange.

 

g) For constructing an NFA in the patten of an option (zero or one element) from an existing NFAs s in the form of s? .  Let NFA(s) be the NFA for the regular expressions s. Construct the following pattern.

thompson_rule7.png

s?

Newly added nodes are in orange.

 

Our example language of ( a | b ) * abb would parse as

thompsons_example2.png

( a | b ) * abb

 

However the following is what you get if a human designs it.

thompsons_example2h.png

A more dramatic example of the inefficiency is the NFA generated for  a ( b | c ) * would look like

thompsons_example1-small.png

a ( b | c ) *

Generated by Thompson's Method

However the following is what you get if a human designs it.

thompsons_example1h.png

a ( b | c ) *

Generated by Common Sense

But its good to know we can automate the generation - even if the result is not optimal.

 

Step 3) Convert the NFA into a DFA (Determinisation)

A Commonly used algorithm - subset construction.

From the point of view of the input, any two states that are connected by an ε-transition may as well be the same, since we can move from one to the other without consuming any character. Thus states which are connected by an ε-transition will be represented by the same states in the DFA.

If it is possible to have multiple transitions based on the same symbol, then we can regard a transition on a symbol as moving from a state to a set of states (ie. the union of all those states reachable by a transition on the current symbol). Thus these states will be combined into a single DFA state.

Ni is a state from the source NFA


∑ is a finite set of symbols, that we will call the alphabet of the language the automaton accepts


Move( Si , a )  is the set of states reachable from Si by a. This function takes a state and a character, and returns the set of states reachable by one transition on this character.


ε-closure( Si ) is the set of states reachable from Si by ε . This function takes a state and returns the set of states reachable from it based on (one or more) ε-transitions. Note that this will always include the state tself. We should be able to get from a state to any state in its ε-closure without consuming any input.

Start state derived from S0 of the NFA
Take ε-closure(S0)
Take the image of S0, Move(S0, ά) for each ά ∈ Σ, and take its ε-closure
Iterate until no more states are added

S0 := ε-closure({N0})
S := { S0 }
W := { S0 }
while ( W != Ø )
   select and remove s from W
   for each ά ∈Σ
      t := ε-closure(Move( s,ά ))
      T[ s , ά ] := t
      if ( t not in S ) then
         add t to S
         add t to W

Output is a set of states S and a set of transitions T.

Worst case is an exponential increase in the number of states and transitions.

A list of conventional algorithms.

Ted Leslie, Efficient Approaches to Subset Construction, masters thesis, Computer Science, University of Waterloo, 1995.[bibtex]

J. Howard Johnson and Derick Wood, Instruction Computation in Subset Construction, in Automata Implementation, Darrell Raymond and Derick Wood and Sheng Yu (eds.), Lecture Notes in Computer Science 1260, pp. 64-71, Springer Verlag, 1997.[bibtex]

Gertjan van Noord, The Treatment of Epsilon Moves in Subset Construction, Computational Linguistics, 26(1), pp. 61-76, April 2000. Available from www.let.rug.nl/~vannoord/papers/[bibtex]

Mehryar Mohri, Finite-State Transducers in Language and Speech Processing, Computational Linguistics, 23(2), pp. 269-311, June 1997.[bibtex]

Mehryar Mohri, On some applications of finite-state automata theory to natural language processing, Natural Language Engineering, vol. 2, pp. 61-80, 1996. Originally appeared in 1994 as Technical Report, institut Gaspard Monge, Paris.[bibtex]

Mehryar Mohri and Fernando C.N. Pereira and Michael Riley, A Rational Design for a Weighted Finite-State Transducer Library, Automata Implementation. Second International Workshop on Implementing Automata, WIA '97, Lecture Notes in Computer Science 1436, Springer Verlag, 1998.[bibtex]

Emmanuel Roche and Yves Schabes, Deterministic Part-of-Speech Tagging with Finite-State Transducers, Computational Linguistics, 21(2), pp. 227-253, June, 1995.[bibtex]

Step 4) Systematicaly Shrink The DFA (Minimisatiion)

A list of conventional algorithms.

The fastest (O(|Q|log|Q|), where |Q| is the number of states) minimization algorithm (by John E. Hopcroft) is described in:

John E. Hopcroft, An n log n algorithm for minimizing the states in a finite automaton, in The Theory of Machines and Computations, Z. Kohavi (ed.), pp. 189-196, Academic Press, 1971.[bibtex]

David Gries, Describing an Algorithm by Hopcroft, Acta Informatica, vol. 2, pp. 97-109, 1973.[bibtex]

A. V. Aho, J. E. Hopcroft and J. D. Ullman, The Design and Analysis of Computer Algorithms, Addison-Wesley Publishing Company, 1974.[bibtex]

The third item on the list (i.e. the book) is probably the easiest to find, but beware that the description is somewhat hidden. Look for "partitioning". The problem of minimizing states in the set of states of an automaton is equivalent to finding the coarsest partition of the set of states, consistent with the initial partition into final and non-final states, such that if states s and t are in one block of the partition, then states delta(s, a) and delta(t, a) are also in one block of the partition for each input symbol a.

An algorithm that is considered by some people to be equally fast for frequently encountered data is Brzozowski's algorithm. It minimizes an automaton by performing reversal and determinization twice. The subset construction (determinization) is O(2|Q|), where Q is a set of states, but it depends on input (see master thesis by Ted Leslie in the Determinization section). The second paper here describes complexity of operations needed to implement Brzozowski's algorithm.

J. A. Brzozowski, Canonical regular expressions and minimal state graphs for definite events, in Mathematical Theory of Automata, Volume 12 of MRI Symposia Series, pp. 529-561, Polytechnic Press, Polytechnic Institute of Brooklyn, N.Y., 1962.[bibtex]

C. Câmpeanu, K. Culik II, and Sheng Yu, State Complexity of Basic Oprations on Finite Languages, utomata Implementation. Proceedings of 4th International Workshop on Implementing Automata, WIA'99, Oliver Boldt and Helmut Jürgensen (Eds.), Postdam, Germany, LNCS 2214, Springer, 2001.[bibtex]

The following algorithm is different from all others because partial results can be used. It tests pairs of states for equivalence. The first paper describes a version with exponential time complexity. The second paper adds modifications that make it almost O(|Q|2).

Bruce W. Watson, An incremental DFA minimization algorithm, Finite State Methods in Natural Language Processing 2001, ESSLLI Workshop, August 20-24, Helsinki, Finland, 2001.[bibtex]

Bruce B. Watson, Jan Daciuk, An efficient incremental DFA minimization algorithm, Natural Language Engineering, Volume 9, Issue 01, March 2003. [bibtex]

The Hopcroft-Ullman is one of the best known. Its complexity is O(|Q|2), and it also requires O(|Q|2) memory, so it does not seem to be suitable for large automata. Pairs of states are considered, and the automaton has to be complete (i.e. it must have a sink, or dead state). A pair of states is distinguished if one of the states in the pair is final, and the other non-final. A pair of states is also distinguished if transitions labeled with the same symbol lead to another pair of states that was found distinguishable. For each pair of states, a list of pairs is found that are targets of transitions. Initially, pairs containing one final, and one non-final state are marked as distinguishable. Then all pairs that are on lists associated with those pairs are marked, and so on until no new pairs are marked. Pairs that are not marked are equivalent.

John E. Hopcroft and Jefferey D. Ullman, Introduction to Automata Theory, Languages, and Computation, Adison-Wesley Publishing Company, Reading, Massachusets, USA, 1979.[bibtex]

The Aho-Sethi-Ullman algorithm has the same complexity as the Hopcroft-Ullman algorithm, i.e. O(|Q|2), but it needs only O(|Q|) memory. As in the Hopcroft algorithm, the set of states is partitioned, and the blocks are refined in each step, until no more divisions occur. Two states are put into different blocks if they have transitions leading to states in different partitions. In difference to the Hopcroft algorithm, forward transitions are used for that purpose (the Hopcroft algorithm uses back transitions to split blocks with transitions leading to the current block).

Alfred V. Aho and Ravi Sethi and Jeffrey D. Ullman, Compilers. Principles, Techniques and Tools, Addison Wesley, 1986.[bibtex]

For acyclic automata, there are better methods than those given above. The minimization part of incremental and semi-incremental construction algorithms can be used.

Transducers

Mehryar Mohri, Compact Representations by Finite-State Transducers, ACL'94, Morgan Kaufmann, San Francisco, California, USA, 1994.[bibtex]

 

Step 5) Generate Code

Trivial Exercise

Friday, June 24, 2005 8:29:57 PM (AUS Eastern Standard Time, UTC+10:00)  #    Comments [0]

I am now dual headed. We will see how it goes. I have the G4 and PC second display on the second monitor. The G4 stays up as long as I don't move much. (Long Story)

Testing all the basic functions of the parsing table generation before I go any further.

Simple Sequence

"first" "last"

http://set-top.net/tests/dot/test14.png

Simple Option

( "A" | "B" )

http://set-top.net/tests/dot/test15.png

Dump Option

( "A" | "A" )

http://set-top.net/tests/dot/test16.png

Note that because of the algorithm used, we get this obvious reduction for free.

Sequence With Option

"A"  ( "X" | "Y" )  "B"

http://set-top.net/tests/dot/test17.png

Sadly this one is completely wrong..... needs work.

Friday, June 24, 2005 8:09:20 PM (AUS Eastern Standard Time, UTC+10:00)  #    Comments [0]

[Rescued from the old blog - Thanks to Damian]

The algorithm and data structure are designed to build a near optimal deterministic finite state automata for a regular language with a single pass, unlike the current three pass 'Thompson Process'.

But even I was shocked when it spat this DFA out for the regular language ('A' | 'A' 'B') 'D'.

test23.png

DFA encapsulated NFA for ('A' | 'A' 'B') 'D'.

('A' | 'A' 'B') 'D' can also be expressed as 'A' ['B'] 'D' - and it is this reduction that has been built.

If I was to take the constructed DFA and convert it back into a regular language I would in effect get the simplified 'A' ['B'] 'D'  

Instead of being slightly more split worldpath wise, the convergence really seems to be very strict.

The cardinal semantics of the state group is slightly different than I expected. The second state clearly contains three states, one of which is mutually exclusive to the other two - it essentially describes the 'states that you may be at this group state for'. I will try a thought experiment and try and work out if I can maintain a strict meaning of

Friday, June 24, 2005 12:04:10 PM (AUS Eastern Standard Time, UTC+10:00)  #    Comments [0]
 Monday, January 24, 2005

Speaking of old bands - one of my old bad techno songs has been released on http://www.openpodcast.org as The Rhinoceros Song - OpenPodcast.org #759 aka "Mastitis" - people used to just call it "The Rhinoceros Song" so it just kind of stuck :)

The Vogon stood up. "No, well you're completely wrong," he said, "I just write poetry music to throw my mean callous heartless exterior into sharp relief. ...

Been staring at the parser code out of the corner of eye for the last week. I have a fair idea of what needs to be done. I need a pattern that can keep track of the last state of all MwW3ParserLexerStates as they thread through all the MwW3ParserLexerStateGroups - the trick being that the last State in all State theads within a set of optional lexemes needs to be remembered so that the subsequent lexemes of approprate logical precedence can add their first State as an option into each of them and converge/reduce onto a Group for that lexeme.

Makes perfect sense to me at least .. I'm in the process of checking that each of these steps is feasible.

Monday, January 24, 2005 10:15:20 PM (AUS Eastern Standard Time, UTC+10:00)  #    Comments [2]
 Friday, January 14, 2005

I am now dual headed. We will see how it goes. I have the G4 and PC second display on the second monitor. The G4 stays up as long as I don't move much. (Long Story)

Testing all the basic functions of the parsing table generation before I go any further.

Simple Sequence

"first" "last"

http://www.metawrap.com/tests/dot/test14.png

Simple Option

( "A" | "B" )

http://www.metawrap.com/tests/dot/test15.png

Dump Option

( "A" | "A" )

http://www.metawrap.com/tests/dot/test16.png

Note that because of the algorithm used, we get this obvious reduction for free.

Sequence With Option

"A"  ( "X" | "Y" )  "B"

http://www.metawrap.com/tests/dot/test17.png

Sadly this one is completely wrong..... needs work.

 

 

Thursday, January 13, 2005 11:40:01 PM (AUS Eastern Standard Time, UTC+10:00)  #    Comments [0]
 Thursday, January 13, 2005

I am now dual headed. We will see how it goes. I have the G4 and PC second display on the second monitor. The G4 stays up as long as I don't move much. (Long Story)

New test case - fixed a few bugs tonight. Testing all the basic stuff before I go any further.

http://www.metawrap.com/tests/dot/test14.png

 

 

 

Thursday, January 13, 2005 1:39:14 AM (AUS Eastern Standard Time, UTC+10:00)  #    Comments [0]
 Wednesday, January 12, 2005

The million dollar question is...

Should a construction which consists of a simple option like "ABBB" | "CBBB" have a parse table that compresses and merges the last sequence of states into common groups?

http://www.metawrap.com/tests/dot/test12.png

In the same way that the start sequence is merged in "ABCDEZ" | "ABCDEX"  This testcase demonstrates a clear reason for having the terminating state objects.

http://www.metawrap.com/tests/dot/test13.png

How about something like "ABBB" | "CBB"

I must sleep and ponder on the practicality. It would result in a more compressed parse table.. but at what cost?

 

Tuesday, January 11, 2005 11:57:16 PM (AUS Eastern Standard Time, UTC+10:00)  #    Comments [0]
 Tuesday, November 09, 2004
Tuesday, November 09, 2004 9:38:49 AM (AUS Eastern Standard Time, UTC+10:00)  #    Comments [0]
 Thursday, November 04, 2004

http://www.metawrap.com/tests/dot/test9.png

Minimal testcase for simplest production I could think of - with an at that un unhandled visitor type. (now implemented of course). The wonderful thing about applying such a complicated visitor pattern to this type of problem is that even though the first type is hard and laborious, the rest seem to quickly drop into place.

space => ' '

My misssion tonight - get rid of that superflouous end group before the terminator.

But this

space => '_' ' '

http://www.metawrap.com/tests/dot/test10.png

is of course just plain wrong.

This is better.

http://www.metawrap.com/tests/dot/test11.png

But we are left with the problem, How do we know that we are the last character of the lexeme and the last lexeme of a production with no further substitutions - and thus a potential terminator ε. This is non trivial to compute at every symbol. It may be better to gather all these dangling groups as part of garbage collect and make them all point to a single terminator group. I will investigate both paths and see what works best. But now for some sleep.

 

Thursday, November 04, 2004 9:14:29 PM (AUS Eastern Standard Time, UTC+10:00)  #    Comments [0]
 Saturday, October 30, 2004

formal grammar is an abstract structure that describes a formal language precisely: i.e., a set of rules that mathematically delineates a (usually infinite) set of finite-length strings over a (usually finite) alphabet.

formal language is a set of finite-length character strings drawn from some finite alphabet.

A context-free grammar (CFG) is a formal grammar in which every production rule is of the form

V → w

where V is a non-terminal symbol and w is a string consisting of terminals and/or non-terminals. The term "context-free" comes from the fact that the non-terminal V can always be replaced by w, regardless of in what context it occurs.

A formal language is context-free if there is a context-free grammar that generates it.

terminals belong to the set of strings of the alphabet of the language. They do not get replaced during the derivation of a sentence.

non-terminals or variables or syntactic-categories are replaced during the derivation. They represent arbitrary strings.

As the parser accepts a sequence of tokens, it determines, based on this information, when the grammar's respective rules are complete and verifies the syntactic correctness of the token sequence. The end result of the process is a derivation which represents the token sequence organized following the rules of the grammar.

A derivation of a string is a sequence of steps where nonterminals are expanded so that the string results.

  S ->
  "if" E "then" S "else" S ->
  "if" "2" "=" "2" "then" S "else" S ->
  "if" "2" "=" "2" "then" "print" E "else" S ->
  "if" "2" "=" "2" "then" "print" "2" "=" "2" "else" S ->
  "if" "2" "=" "2" "then" "print" "2" "=" "2" "else" "print" E ->  
  "if" "2" "=" "2" "then" "print" "2" "=" "2" "else" "print" "2" "=" "2"

A leftmost derivation is one where every step expands the leftmost nonterminal. A rightmost derivation expands the rightmost nonterminal.

A derivation need not be rightmost or leftmost. Many derivations can result in the same syntax tree.

A derivation starting from the start symbol gives a member of the language (after all variables have been eliminated)

Every context gree language has a start symbol.

The start symbol represents all valid sentences in the language

The head (of a production) a single variable for context free langiuages in a similar notation for non-context-free languages, the head may contain more than one symbol. The body (of the production) a sequence of symbols, variables as well as terminals are allowed.

To infer that a sentence is in the language

Recursive Inference:

From body to head - Start from terminals and work up the tree to the start symbol

Derivation:

From head to body - Start from the start symbol and apply productions until you get the sentence

Regular Languages are a subset of Context Free Languages All regular languages are also context free

Definition of  a Context Free Grammar

G = {V,T,P,S}

V = Finite set of variables

nT = Finite set of terminals
 
nP = Set of productions
S = Start symbol

For language

L = { 0n1n | n >= 1}

S -> 0 X 1

X -> ε

X -> 0 X 1

The grammar is

G = ( {S,X} , {0,1} , {(S,0X1),(X,ε),(X,0X1)},S )

The ancient Indian linguist Panini described Sanskrit using a context-free grammar. Recently, it has been suggested that a class of Tamil poetry called Venpa is governed by a context-free grammar.

Saturday, October 30, 2004 11:32:24 PM (AUS Eastern Standard Time, UTC+10:00)  #    Comments [0]

Its pretty obvious from this failed experiment that my original plan was the correct one.

http://set-top.net/tests/dot/test7.png

By associating an exact description of the potential state, but then using transitions wrt. the group of states, I've created something that makes logical sense given the underlying invisible group context - but provides a visual distortion. Even though the dot notation ."sy*ntax" | "synth" signifes that we have just parsed the first two characters of syntax, a potental transition via 'n' to ."syntax" | "syn*th" is shown which flips the path to another lexeme being the current focus. It should really show something like "sy*ntax" | "sy*nth"

This group based view is much better and shows the above state as "sy*ntax" | "synth" and "syntax" | "sy*nth"

http://set-top.net/tests/dot/test8.png

Now that I have a basic visualiser, its time to start using it as it was intended, to help develop and debug the rest of the parse table generator.

Looks like I'm creating an extra group at the end before the terminator. I really need to have a sentinel which is a terminator symbol (phi?).


Another good website on parsing theory.

http://www.cs.chalmers.se/Cs/Grundutb/Kurser/komp/2004/lectures/ccc05.html

Saturday, October 30, 2004 7:04:58 PM (AUS Eastern Standard Time, UTC+10:00)  #    Comments [0]
 Monday, October 18, 2004

http://documentation.metawrap.com/tests/dot/test3.png (generated automagically)

http://documentation.metawrap.com/tests/dot/test.png (the original target plan)

It lurches into being.

I'm thinking that I may have to write this with MwW3ParserLexerState being the node instead of the current plan for MwW3ParserLexerStateGroup. The issue at hand is that a MwW3ParserLexerStateGroup can consist of multiple MwW3ParserLexerState's, but its each MwW3ParserLexerState that represents a point state within the grammar - and its these that will give me my lovely dot notation description in the digraph nodes.

[UPDATE Monday]

http://documentation.metawrap.com/tests/dot/test4.png  is a tidy version of test3 which is a state group based walk of the parse table.

http://documentation.metawrap.com/tests/dot/test5.png is a strange mutant graph of a state based walk of the parse tables.. it actually represents something very interesting but mostly useless unless you are into quantum computing :)

[UPDATE Tuesday]

http://documentation.metawrap.com/tests/dot/test6.png is state based but with filtering duplicate edges. Compared to test5, the world lines no longer represent the set of all possible paths, but simply the transitions that can be used to build each path.

 

Monday, October 18, 2004 10:07:14 PM (AUS Eastern Standard Time, UTC+10:00)  #    Comments [1]
 Wednesday, September 22, 2004

I'm working on the garbage collector for my w3 spec parse table generator and from the first time run of its mark pass you can see the structure of the parse table for the production rule

word := "syntax" | "synth"

----------- build completed -------------
/==================== MwW3ParserLexerStateGroup::mark ===============================================
|  /==================== MwW3ParserParsetable::Dump =================================================
|  |  0073 's' 1
|  \-------------------- MwW3ParserParsetable::Dump -------------------------------------------------
|  /==================== MwW3ParserParsetable::mark =================================================
|  |  /==================== MwW3ParserLexerStateGroup::mark =========================================
|  |  |  mark 1
|  |  |  /==================== MwW3ParserLexerStateGroup::mark ======================================
|  |  |  |  /==================== MwW3ParserParsetable::Dump ========================================
|  |  |  |  |  0079 'y' 1
|  |  |  |  \-------------------- MwW3ParserParsetable::Dump ----------------------------------------
|  |  |  |  /==================== MwW3ParserParsetable::mark ========================================
|  |  |  |  |  /==================== MwW3ParserLexerStateGroup::mark ================================
|  |  |  |  |  |  mark 1
|  |  |  |  |  |  /==================== MwW3ParserLexerStateGroup::mark =============================
|  |  |  |  |  |  |  /==================== MwW3ParserParsetable::Dump ===============================
|  |  |  |  |  |  |  |  006E 'n' 1
|  |  |  |  |  |  |  \-------------------- MwW3ParserParsetable::Dump -------------------------------
|  |  |  |  |  |  |  /==================== MwW3ParserParsetable::mark ===============================
|  |  |  |  |  |  |  |  /==================== MwW3ParserLexerStateGroup::mark =======================
|  |  |  |  |  |  |  |  |  mark 1
|  |  |  |  |  |  |  |  |  /==================== MwW3ParserLexerStateGroup::mark ====================
|  |  |  |  |  |  |  |  |  |  /==================== MwW3ParserParsetable::Dump ======================
|  |  |  |  |  |  |  |  |  |  |  stack depth:  1688 bytes
|  |  |  |  |  |  |  |  |  |  |  0074 't' 1
|  |  |  |  |  |  |  |  |  |  \-------------------- MwW3ParserParsetable::Dump ----------------------
|  |  |  |  |  |  |  |  |  |  /==================== MwW3ParserParsetable::mark ======================
|  |  |  |  |  |  |  |  |  |  |  /==================== MwW3ParserLexerStateGroup::mark ==============
|  |  |  |  |  |  |  |  |  |  |  |  mark 1
|  |  |  |  |  |  |  |  |  |  |  |  /==================== MwW3ParserLexerStateGroup::mark ===========
|  |  |  |  |  |  |  |  |  |  |  |  |  /==================== MwW3ParserParsetable::Dump =============
|  |  |  |  |  |  |  |  |  |  |  |  |  |  0061 'a' 0
|  |  |  |  |  |  |  |  |  |  |  |  |  |  0068 'h' 1
|  |  |  |  |  |  |  |  |  |  |  |  |  \-------------------- MwW3ParserParsetable::Dump -------------
|  |  |  |  |  |  |  |  |  |  |  |  |  /==================== MwW3ParserParsetable::mark =============
|  |  |  |  |  |  |  |  |  |  |  |  |  |  /==================== MwW3ParserLexerStateGroup::mark =====
|  |  |  |  |  |  |  |  |  |  |  |  |  |  |  mark 0
|  |  |  |  |  |  |  |  |  |  |  |  |  |  |  /==================== MwW3ParserLexerStateGroup::mark ==
|  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  /==================== MwW3ParserParsetable::Dump ====
|  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  0078 'x' 0
|  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  \-------------------- MwW3ParserParsetable::Dump ----
|  |  |  |  |  |  |  |  |  |  |  |  |  |  |  \-------------------- MwW3ParserLexerStateGroup::mark --
|  |  |  |  |  |  |  |  |  |  |  |  |  |  \-------------------- MwW3ParserLexerStateGroup::mark -----
|  |  |  |  |  |  |  |  |  |  |  |  |  \-------------------- MwW3ParserParsetable::mark -------------
|  |  |  |  |  |  |  |  |  |  |  |  \-------------------- MwW3ParserLexerStateGroup::mark -----------
|  |  |  |  |  |  |  |  |  |  |  \-------------------- MwW3ParserLexerStateGroup::mark --------------
|  |  |  |  |  |  |  |  |  |  \-------------------- MwW3ParserParsetable::mark ----------------------
|  |  |  |  |  |  |  |  |  \-------------------- MwW3ParserLexerStateGroup::mark --------------------
|  |  |  |  |  |  |  |  \-------------------- MwW3ParserLexerStateGroup::mark -----------------------
|  |  |  |  |  |  |  \-------------------- MwW3ParserParsetable::mark -------------------------------
|  |  |  |  |  |  \-------------------- MwW3ParserLexerStateGroup::mark -----------------------------
|  |  |  |  |  \-------------------- MwW3ParserLexerStateGroup::mark --------------------------------
|  |  |  |  \-------------------- MwW3ParserParsetable::mark ----------------------------------------
|  |  |  \-------------------- MwW3ParserLexerStateGroup::mark --------------------------------------
|  |  \-------------------- MwW3ParserLexerStateGroup::mark -----------------------------------------
|  \-------------------- MwW3ParserParsetable::mark -------------------------------------------------
\-------------------- MwW3ParserLexerStateGroup::mark -----------------------------------------------
----------- garbage collect completed ---------------------------------------------------------------

* note to self - never open blog editor in two windows - bad things happen.. I just lost a document from a few months ago that I was grabbing some HTML from - luckily it was still in the browser history.

Wednesday, September 22, 2004 11:10:46 PM (AUS Eastern Standard Time, UTC+10:00)  #    Comments [0]
 Thursday, Au