As part of the MetaWrap Continuous Integration Project (mw_monitor), I have decided to add XPath to the MetaWrap XML engine. There are currently three methods I could use to process XML in mw_monitor.
DTD Plug-in – Using this method I would define a DTD and then create handler code for each of the elements. This produces a very fast and tightly bound XML processor that can trigger events during parse, import or as a post load processing run.
XML Visitor – Classes can be created that will handle a particulart element or atribute combination. The XML Tree is then processed and those objects that match fire events.
DOM – The MwXML Document class was created before DOM became a standard so its functionality intersects with but is not 1:1 with DOM. Its a future project to add full DOM like behavior and member names.
I could use one of these but I want to add one more which is XPath mainly because I have gotten myself into a very sexy R&D project that predicates the creation of a custom XSLT processor. I also want to cleave off into XQuery at some point in the future .
The MetaWrap XML Parser owes its origins to work I did with SGML and electronic publishing in the early 90s. The current version of the MetaWrap parser has come a long way since its primitive beginnings.
Fast forward 8 years and The MetaWrap XML Engine was being used for a specific purpose, which was the automatic classification and transcoding of HTML sites into XML. Before XSLT there was the MetaWrap Pattern Language This could define a predicate that could be thrown against a website. It would bind to the website in a similar fashion to how an antigen bonds to a chemical. The theory was that those patterns that bound well would be thrown back into a genetic algorithm and interbred. With a carefully orchestrated breeding program, a population of different species of pattern would emerge that could classify a HTML site and extract content from them.
<F:PATTERN name=”pgrid_mon”> <F:PAT name=’day’ xml noid> <F:PAT tag=’TR’> <F:PAT tag=’TD’ flex match=”contains(‘*CMT*PACIFIC*RIM*PROGRAMMING*SCHEDULE*’)”></F:PAT> </F:PAT> <F:PAT tag=’TR’> <F:PAT tag=’TD’> <F:PAT tag=’FONT’ name=’monthyear’ xml noid ></F:PAT> </F:PAT> </F:PAT> <F:PAT tag=’TR’> <F:PAT tag=’TD’ count=1></F:PAT> <F:PAT tag=’TD’> <F:PAT tag=’FONT’ name=’monthday’ onparse=”mymonthday = trim(tag_contents(document.this))” xml noid></F:PAT> </F:PAT> <F:PAT tag=’TD’ count=N></F:PAT> </F:PAT> <F:PAT tag=’TR’></F:PAT> <F:PAT tag=’TR’ name=”timeslot” count=N noid xml quiet> <F:PAT tag=’TD’> <F:PAT tag=’FONT’ match=’istime’ name=’time’ onparse =”append_contents(‘ <%/mymonthday%>’)” xml noid></F:PAT> </F:PAT> <F:PAT tag=’TD’> <F:PAT tag=’FONT’ name=’name’ xml noid></F:PAT> </F:PAT> <F:PAT tag=’TD’ count=N></F:PAT> </F:PAT> <F:PAT tag=’TR’ count=n ></F:PAT> </F:PAT> </F:PATTERN> |
Example Of First Generation MetaWrap pattern. This pattern was part of a larger set of patterns that could take a Cable TV Station playlist Excel file saved as HTML and convert it from a set of columns to a linear XML programming schedule schedule schema which was then processed and added to a database. This was developed late 1998,. Early 1999
<pattern name=”spracipat”> <pat name=”day”> <pat tag=”dt” lasttag=”dd” unary=”true” text=true name=”dayname” /> <pat name=”event” count=”N”> <pat tag=”b” lastflex=”true” text=”true” name=”name” xml=”true” /> <pat tag=”br” lasttag=”hr” lastflex=”true” text=true name=”content” unary=”true”/> </pat> </pat> </pattern> |
Example Of Second Generation MetaWrap pattern.
Here is an example of the pattern in use. Looks remarkably like XAML huh? 🙂 Not bad for something built in late 1999.
Here is screenshot for a tool developed by the MetaWrap project for generating those patterns.