US Election 2004

The ultimate results of the US elections have such a large, far reaching and disproportionate effect on the rest of the world. Obviously far far too important to leave the voting up to, Americans.

Posted in Rants | Leave a comment

Chocks Away

Gastro again.

Ralph and Huey get one ceremony closer to obtaining shiny new souls.

Posted in Downtime | Leave a comment

Shared Infrustrature for W3 Specification Parsing – Part I

I have defined a new module for MetaWrap which consists of 4 Logical groups of classes. MwParser, MwParserBNF* and MwParserCST and MwParserAST.

All the new W3 Specs seem to be heading towards shared definitions using BNF. (eh XQuery, XPath2.0 and XSLT2.0) . These new modules are a way by which code within MetaWrap that implements the parsing and interpretation these specifications can self generate parsers and Abstract Syntax Trees while sharing a majority of the infrustrature.

By following this path it should be possible to shrink the W3 parsers into a single code base so that parsing code is not replicated as it is at the moment.

My initial approach is to just implement stub classes here so that the data flow can be worked out properly and I can gather the rest of the requirements. My main aim is still to develop XPath as quickly as possible.

MwParserEBNF

MwParserEBNF implements a EBNF Grammar. This consists of a group of classes that can be overridden to specialise the parsing and generation of Abstract Syntax Trees. It is the responsibility of the specialised MwParserEBNF* classes to define what is generated in the final AST. The implementations of the MwParserEBNF* classes contain hints (Grammar Notes) to aid in tokenisation.

A.1.1 Grammar Notes

This section contains general notes on the EBNF productions, which may be helpful in understanding how to create a parser based on this EBNF, how to read the EBNF, and generally call out issues with the syntax. The notes below are referenced from the right side of the production, with the notation: /* gn: <id> */.

grammar-note: parens

A look-ahead of one character is required to distinguish function patterns from a QName followed by a comment. For example: address (: this may be empty :) may be mistaken for a call to a function named “address” unless this lookahead is employed.

grammar-note: lt

Token disambiguation of the overloaded “<” pattern is defined in terms of positional lexical states. The “<” comparison operator can not occur in the same places as a “<” tag open pattern. The “<” comparison operator can only occur in the OPERATOR state and the “<” tag open pattern can only occur in the DEFAULT state. (These states are only a specification tool, and do not mandate an implementation strategy for this same effect.)

grammar-note: leading-lone-slash

The “/” presents an issue because it occurs both in a leading position and an operator position in expressions. Thus, expressions such as “/ * 5” can easily be confused with the path expression “/*”. Therefore, a stand-alone slash, in a leading position, that is followed by an operator, will need to be parenthesized in order to stand alone, as in “(/) * 5”. “5 * /”, on the other hand, is fine.

grammar-note: comments

Expression comments are allowed inside expressions everywhere that ignorable white space is allowed.

grammar-note: xml-version

The general rules for [XML 1.1] vs. [XML 1.0], as described in the A.2 Lexical structure section, should be applied to this production.

Notes from XPath2.0 Specification about grammar hints

 

Definition : Context Free Grammar

http://www.fact-index.com/c/co/context_free_grammar.html

Good simple eplanation of LL vs LR of parsing here.

MwParser implements a parser and is initialized with a BNF. A MwParser can then take input and build a MwCST (Concrete Syntax Tree) or a MwAST (Abstract Syntax Tree). If The BNF classes are specialised then they can generate a specialized and or decorated AST/CST nodes. Parsing implemented by the walking of state tables. (fig 1) generated from the target grammar.

MwParserCST imlements an Concrete Syntax Tree which is the output of MwParser and its construction is performed by specialised MwBNF* classes.

Definition:  Concrete Syntax Tree

A parse tree is a record of the rules (and tokens) used to match some input text.

A parse tree is a labeled tree in which the interior nodes represent nonterminals, leaf nodes represent terminals, and the children of interior nodes represent the replacement of the associated nonterminal

A concrete syntax tree contains all the text that has been parsed, so also all the white space (spaces, tabs, newlines) and code comments. The original text can be completely reproduced from a concrete syntax tree.

MwParserAbstractor translates the Concrete Syntax Tree to an Abstract Syntax Tree

MwParserAST Implements an Abstract Syntax tree.

ASTs are superior because they record the structure of the input not the artifacts of your grammar; consequently they are insensitive to changes in your input grammar, easier to manipulate during translation, and are much smaller.

There are two kinds of parse trees: (1) Abstract Syntax Tree eliminates non-interesting terminals and non-terminals, and (2) Concrete Syntax Tree preserves all symbols and exactly represents the input source. (fig 2)

Definition:  Abstract Syntax Tree

A abstract syntax tree records the structure of the input and is insensitive to the grammar that produced it.

 

It is the role of an AST to provide a clean interface between parser and later phases.

A C++ class based parser that uses a runtime generated BNF tree suffers in speed in that it must search a grammar tree to identify which token is next instead of being able to use a tightly constrained switch operation.

So for the following BNF Grammar

Binary      ::= Zero | One;

Zero        ::= ‘0’;

One         ::= ‘1’;

Lets say the next character is ‘1’

In a grammar tree when calling Binary::parse(char p c) we need to ask both Zero and One if ‘1’ matches them.  For small numbers of terms it should be reasonably fast. But we always have the option of emitting custom C++ classes and implementations of Binary::parse(char p c).

Initial Requirements include…

A generic MwParser should be able to take in a MwBNFGrammar and build an MwAST.

A generic MwParser should be able to take in a MwBNFGrammar and emit code for a specialised MwParser that can build the MwAST.

MwBNFGrammar must be able to intelligently handle constants such that we don’t build dumb AST. For example

number ::= {dgt} [‘.’ {dgt}] [(‘e’|’E’) [‘-‘] {dgt}];


dgt ::= ‘0’|’1’|’2’|’3’|’4’|’5’|’6’|’7’|’8’|’9′;

For input “1.3e-8”

The AST tree for this should be

number(“1.3e-8”)

and not

number(dgt(1),’.’,dgt(3),’e’,’-’, dgt(8))

So we really have two levels of BNF tokenisation. We somehow gave to mark certain assignments as being atomic constants such that if all they consist of is constants, then we are just building a string. The boundary between these levels defines where Lexical processing and parsing begins.

But my initial requirements are simply what can get me parsing simple XPath ASAP using the minimum part of this framework such that it can grow as I require it.

Posted in Parsing Theory, XPath | Leave a comment

Some thoughts about parsing, protocols and XPath

Parsing And Protocols

There are really two types of information packing in protocols and grammars and these boil down to data being length or lexically delimited. A parser implements a certain grammar. The grammar parses a certain pattern of tokens into memory. How do you know when you have completed parsing a token from the grammar? Either the grammar defines an explicit length for the token or it defines a beginning and end signifier (either through a lower order lexical grammar or through a dictionary)

For example

Length Explicit

[5][H][e][l][l][o]

From this we have

Hello

In this grammar we start with the first byte defining the number of bytes to follow. This is my most favorite method of packing because it allows you to allocate the minimum amount of memory on the receiving end.. which in MetaWrap is often a thin client . This runs into problems if you want to pack things in a hierarchy. Eg.

[12][5][H][e][l][l][o][5][W][o][r][l][d]

When preparing this for transmission, you can’t send the first byte until you know the length of every element in the hierarchy.

Lexical

[“][h][e][l][l][o][”]

We have a starting character for a string and a terminating character. We start with the [“]and keep grabing data until we get the [”].We don’t know when we will get the [”].So we can’t prepare the correct buffer length.

This is why I prefer Length Explicit packing and grammars. They scale better. This is why MetaWrap uses this as its primary data packing and transmission format.

One Or Two Passes?

The XPath 1.0 Specification states that

Expressions are parsed by first dividing the character string to be parsed into tokens and then parsing the resulting sequence of tokens. Whitespace can be freely used between tokens. The tokenization process is described in [3.7 Lexical Structure].

This seems to suggest that it should be performed in two passes, but from reading the rest of the specification, I see no reason why it can’t be performed in one pass.

Update:

The XPath 2.0 Specification is a bit more explicit

During the static analysis phase, the XPath expression is parsed into an internal representation called the operation tree (step SQ1 in Figure 1). A parse error is raised as a static error.[err:XP0003] The static context is initialized by the implementation (step SQ2). The static context is used to resolve type names, function names, namespace prefixes and variable names. 

Posted in Parsing Theory, XPath | 2 Comments

Embedding semantic actions into syntax rules

I’m hunting for terms to use in my own algorithm for a combined lexer and parsing engine. Got my algorithm documented now in my handy dandy notebook that I have scribbling in for the last week on the train and in bed.

I’m now stuck with what to call the various actors and objects in my algorithm – so back to scouring the standards to work out what names I can choose that will be accurate and acceptable.

Grammar Basics

http://www.isical.ac.in/~mandar/compilers/parsing.pdf

http://wombat.doc.ic.ac.uk/foldoc/foldoc.cgi’concrete+syntax

http://wombat.doc.ic.ac.uk/foldoc/foldoc.cgi’abstract+syntax

Bottom Up Basics

http://pdclab.cs.ucdavis.edu/~pandey/Teaching/ECS142/Lects/bottom.pdf

Dot Notation

http://www.eecis.udel.edu/~elloyd/cis672/parser.html

http://es.udmercy.edu/~daimikj/html/CSC541/CSC541LR1.htm

Conflits

http://lambda.uta.edu/cse5317/notes/node20.html

Embedding semantic actions into syntax rules

http://www.scifac.ru.ac.za/compilers/cha11g.htm

Posted in Parsing Theory, XPath | Leave a comment

The Spectre Of Gastro

Little one was up all-night throwing up.. went through every change of sheets, towel and PJ’s in house.

For the rest of us – the timebomb is ticking 😦

Posted in Meta-Narrative | Leave a comment

Grammar Visualiser

Mastered dot and the fine art of creating pretty digraphs. Developed example script which produces a nice image that will aid in the visualisations of grammar tables.

Now just need to make it build the dot script for a given grammar.

Posted in XPath | Leave a comment

One Line Unit Test

Added a nice macro to the metawrap testing framework that allows a simple unit test to be put in one line in C and  C++ with exception handling and failover reporting.

void Tests_MwNativeString()
{
    #define MW_UNIT MwNativeString 
    MwNativeString l_string;
    MW_UNIT_TEST_ASSERT(Constructor1,l_string.emptystring(),Ensure that constructor starts with an empty string“);
    MwNativeString l_string2(“Hello”);
    MW_UNIT_TEST_ASSERT(Constructor2,l_string2.is(“Hello”),Ensure that constructor populates.“);
    #undef MW_UNIT
}

Nice and simple. The individual unit test is not segregated from others – so it would not be wise to rely on the results of testcases.

Good articles on “Good OO Design” are rare. This is one of them.

Posted in Uncategorized | Leave a comment

Things To Do…..

  1. Continuous Integration *started* June 28th 2004  (Branched into W3 EBNF parser)
  2. Tidy Up Website and add Google Ads  *done*  July 9th 2004
  3. Integrate Client Into Shell
  4. Design Client <-> Server Connection Semantics
  5. Contuinue Design Of Clustering Algorithm
  6. Get shell integrated into Apache mod.
  7. Add more instructions to VM
  8. Add plugin-able opcodes to VM
  9. CNI security domains
  10. Universal W3 Parser Generator [UPDATE]
    1. Viewer for Parse Tree

 

 

Posted in Things To Do | Leave a comment

Unix Signals And C++ Exceptions Living Together

Signals and exceptions don’t mix well, and should be considered seperately. While you can define an exception-based “wrapper” for signals, such code is not portable, because C++ does not guarantee that a signal-handling function is able to interact with any other part of a program. Creating a signal handler that throws an exception, for example, is undefined behavior in Standard C++.”

Thankfully this is not so true anymore. I can attest from experience that Solaris and HPUX have allowed you to do this for a while, but GNU/Linux has always been a problem child. Thankfully a lot of effort has recently been put into GCC, allowing you to trigger an exception from within syncronous signals, thus bridging the gap, sadly however the x86 platform does not seem to be one of the architectures fixed, and the gcc-java compiler only seems to be able to do this by rewriting the stack frame from within the signal.

If you are using GCC, you cannot expect to throw exceptions through arbitrary C library functions like raise(). If you don’t do this and want to handle asynchronous exceptions, you need at least -fnon-call-exceptions in all code that might raise synchronous signals and/or -fasynchronous-unwind-tables in all code that might be interrupted by asynchronous signals you want to throw from (obviously you have to make sure that you don’t throw while the asynchronous signal interrupted some function without exception support).

[UPDATE]

So far using -fnon-call-exceptions with gcc3.4 on x86 I have managed to allow a signal to throw an exception, but then it ends up in the designated terminate and atexit code. Whoopdy do! But catching that exception in the context that generated the signal is still out of my reach. This is obviously because the signal is happening on a different stack. If you read the default behavior for a throw() with no catch, then the issue seems clearer.

http://www.comptechdoc.org/os/linux/programming/linux_pgsignals.html

http://gcc.gnu.org/ml/gcc/2004-06/msg00662.html

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/vclang/html/_pluslang_The_try.2c_.catch.2c_.and_throw_Statements.asp

http://www.google.com.au/search?q=cache:o4T5WyWnt0MJ:www.coyotegulch.com/coyotica/cpp/exceptions.html+C%2B%2B+exceptions+signals+linux&hl=en

http://www.gnu.org/software/gcc/java/port-signals.html

http://www.gnu.org/software/libc/manual/html_node/Signal-Handling.html

http://c2.com/cgi/wiki?UnwindingTheStack

[UPDATE] Time to give up on this – even -fsignaling-nans has no effect. I have fixed the major problem in that some of my test cases were loosing the plot every day and logging an infinite number of execptions to a finite harddrive. So its back to the parser.

Output from test_mw_unix

! MAIN P00000000 M00000000 12:26:11.941     00016384 _ + spawning as daemon disabled by default – use command line argument ‘–enable-daemon’ to spawn as a daemon
! MAIN P00000000 M00000000 12:26:11.941 EST 00016384 _ + using local timezone ‘EST’
! MAIN P00000000 M00000000 12:26:11.942 EST 00016384 _ + initialising metawrap engine.
! BOOT P00000000 M00000000 12:26:11.943 EST 00016384 _ + startup
! INIT P00000000 M00000000 12:26:11.943 EST 00016384 _ + starting up low level services
! BOOT P00000000 M00000000 12:26:11.943 EST 00016384 _ + setting random seed
! INIT P00000000 M00000000 12:26:11.945 EST 00016384 _ + root_cluster_key loaded
! INIT P00000000 M00000000 12:26:11.952 EST 00016384 _ + root_cluster_key validated
! INIT P00000000 M00000000 12:26:11.952 EST 00016384 _ + scanning for plugins
+ running test ‘test_mw_unix’
Using unix timer ‘”CLOCK_THREAD_CPUTIME_ID”‘
test ‘mw_unix compiles’ PASSED
! MAIN P00000000 M00000000 12:26:11.045 EST 00016384 _ + TestClass
! MAIN P00000000 M00000000 12:26:11.045 EST 00016384 _ + throw exception.
test ‘MwUnix::signal::throw::1’ PASSED
! MAIN P00000000 M00000000 12:26:11.046 EST 00016384 _ + throw exception.
test ‘MwUnix::signal::throw::2’ PASSED
! MAIN P00000000 M00000000 12:26:11.046 EST 00016384 _ + cause SIGFPE.
! SIGH P00000000 M00000000 12:26:11.047 EST 00016384 _ SIGNAL [Floating point exception(8) occured [pid 20373 tid 16384], recovering]
! SIGH P00000000 M00000000 12:26:11.047 EST 00016384 _ + throwing exception
! MAIN P00000000 M00000000 12:26:11.048 EST 00016384 _ + terminate

This means that the last test failed and caused a complete abort. 

Things to do – add ability to testcase system to register a testcase and report a failure if we do an ABEND.

 

Posted in C#, Things To Do | Leave a comment