Adding XPath to Metawrap – Part I

As part of the MetaWrap Continuous Integration Project (mw_monitor), I have decided to add XPath to the MetaWrap XML engine. There are currently three methods I could use to process XML in mw_monitor.

DTD Plug-in – Using this method I would define a DTD and then create handler code for each of the elements. This produces a very fast and tightly bound XML processor that can trigger events during parse, import or as a post load processing run.
XML Visitor – Classes can be created that will handle a particulart element or atribute combination. The XML Tree is then processed and those objects that match fire events.
DOM – The MwXML Document class was created before DOM became a standard so its functionality intersects with but is not 1:1 with DOM. Its a future project to add full DOM like behavior and member names.

I could use one of these – but I want to add one more – which is XPath – mainly because I have gotten myself into a very sexy R&D project that predicates the creation of a custom XSLT processor. I also want to cleave off into XQuery at some point in the future .

The MetaWrap XML Parser owes its origins to work I did with SGML and electronic publishing in the early 90’s. The current version of the MetaWrap parser has come a long way since its primitive beginnings.

Fast forward 8 years and The MetaWrap XML Engine was being used for a specific purpose, which was the automatic classification and transcoding of HTML sites into XML. Before XSLT there was the ‘MetaWrap Pattern Language’ This could define a predicate that could be thrown against a website. It would bind to the website in a similar fashion to how an antigen bonds to a chemical. The theory was that those patterns that bound well would be thrown back into a genetic algorithm and interbred. With a carefully orchestrated breeding program, a population of different species of pattern would emerge that could classify a HTML site and extract content from them.

<F:PATTERN name=”pgrid_mon”>
<F:PAT name=’day’ xml noid>
  <F:PAT tag=’TR’>
<F:PAT tag=’TD’ flex match=”contains(‘*CMT*PACIFIC*RIM*PROGRAMMING*SCHEDULE*’)”></F:PAT>

</F:PAT>
<F:PAT tag=’TR’>
<F:PAT tag=’TD’>
<F:PAT tag=’FONT’ name=’monthyear’ xml noid  ></F:PAT>
</F:PAT>
</F:PAT>
<F:PAT tag=’TR’>
<F:PAT tag=’TD’ count=1></F:PAT>
<F:PAT tag=’TD’>
<F:PAT tag=’FONT’ name=’monthday’ onparse=”mymonthday = trim(tag_contents(document.this))” xml noid></F:PAT>
</F:PAT>
<F:PAT tag=’TD’ count=N></F:PAT>
</F:PAT>
<F:PAT tag=’TR’></F:PAT>
<F:PAT tag=’TR’ name=”timeslot” count=N noid xml quiet>
<F:PAT tag=’TD’>
<F:PAT tag=’FONT’ match=’istime’ name=’time’ onparse =”append_contents(‘ <%/mymonthday%>’)” xml noid></F:PAT>
</F:PAT>
<F:PAT tag=’TD’>
<F:PAT tag=’FONT’ name=’name’ xml noid></F:PAT>
</F:PAT>
<F:PAT tag=’TD’ count=N></F:PAT>
</F:PAT>
<F:PAT tag=’TR’ count=n ></F:PAT>
</F:PAT>
</F:PATTERN>

Example Of First Generation MetaWrap pattern. This pattern was part of a larger set of patterns that could take a Cable TV Station playlist Excel file saved as HTML and convert it from a set of columns to a linear XML programming schedule schedule schema which was then processed and added to a database. This was developed late 1998,. Early 1999

<pattern name=”spracipat”>
<pat name=”day”>
<pat tag=”dt” lasttag=”dd” unary=”true” text=true name=”dayname” />
<pat name=”event” count=”N”>
<pat tag=”b” lastflex=”true” text=”true” name=”name” xml=”true” />
<pat tag=”br” lasttag=”hr” lastflex=”true” text=true name=”content” unary=”true”/>
</pat>
</pat>
</pattern>  

Example Of Second Generation MetaWrap pattern.

Here is an example of the pattern in use. Looks remarkably like XAML huh? 🙂 Not bad for something built in late 1999.

screenshot1

Here is screenshot for a tool developed by the MetaWrap project for generating those patterns.

screenshot2

About James McParlane

CTO Massive Interactive. Ex Computer Whiz Kid - Now Grumpy Old Guru.
This entry was posted in XPath. Bookmark the permalink.

Leave a comment