Adding XPath to Metawrap – Part I | James McParlane's Blog

As part of the MetaWrap Continuous Integration Project (mw_monitor), I have decided to add XPath to the MetaWrap XML engine. There are currently three methods I could use to process XML in mw_monitor.

DTD Plug-in – Using this method I would define a DTD and then create handler code for each of the elements. This produces a very fast and tightly bound XML processor that can trigger events during parse, import or as a post load processing run.
XML Visitor – Classes can be created that will handle a particulart element or atribute combination. The XML Tree is then processed and those objects that match fire events.
DOM – The MwXML Document class was created before DOM became a standard so its functionality intersects with but is not 1:1 with DOM. Its a future project to add full DOM like behavior and member names.

I could use one of these but I want to add one more which is XPath mainly because I have gotten myself into a very sexy R&D project that predicates the creation of a custom XSLT processor. I also want to cleave off into XQuery at some point in the future .

The MetaWrap XML Parser owes its origins to work I did with SGML and electronic publishing in the early 90s. The current version of the MetaWrap parser has come a long way since its primitive beginnings.

Fast forward 8 years and The MetaWrap XML Engine was being used for a specific purpose, which was the automatic classification and transcoding of HTML sites into XML. Before XSLT there was the MetaWrap Pattern Language This could define a predicate that could be thrown against a website. It would bind to the website in a similar fashion to how an antigen bonds to a chemical. The theory was that those patterns that bound well would be thrown back into a genetic algorithm and interbred. With a carefully orchestrated breeding program, a population of different species of pattern would emerge that could classify a HTML site and extract content from them.

<F:PATTERN name=”pgrid_mon”>
<F:PAT name=’day’ xml noid>
<F:PAT tag=’TR’>
<F:PAT tag=’TD’ flex match=”contains(‘*CMT*PACIFIC*RIM*PROGRAMMING*SCHEDULE*’)”></F:PAT>
</F:PAT>
<F:PAT tag=’TR’>
<F:PAT tag=’TD’>
<F:PAT tag=’FONT’ name=’monthyear’ xml noid ></F:PAT>
</F:PAT>
</F:PAT>
<F:PAT tag=’TR’>
<F:PAT tag=’TD’ count=1></F:PAT>
<F:PAT tag=’TD’>
<F:PAT tag=’FONT’ name=’monthday’ onparse=”mymonthday = trim(tag_contents(document.this))” xml noid></F:PAT>
</F:PAT>
<F:PAT tag=’TD’ count=N></F:PAT>
</F:PAT>
<F:PAT tag=’TR’></F:PAT>
<F:PAT tag=’TR’ name=”timeslot” count=N noid xml quiet>
<F:PAT tag=’TD’>
<F:PAT tag=’FONT’ match=’istime’ name=’time’ onparse =”append_contents(‘ <%/mymonthday%>’)” xml noid></F:PAT>
</F:PAT>
<F:PAT tag=’TD’>
<F:PAT tag=’FONT’ name=’name’ xml noid></F:PAT>
</F:PAT>
<F:PAT tag=’TD’ count=N></F:PAT>
</F:PAT>
<F:PAT tag=’TR’ count=n ></F:PAT>
</F:PAT>
</F:PATTERN>

Example Of First Generation MetaWrap pattern. This pattern was part of a larger set of patterns that could take a Cable TV Station playlist Excel file saved as HTML and convert it from a set of columns to a linear XML programming schedule schedule schema which was then processed and added to a database. This was developed late 1998,. Early 1999

Example Of Second Generation MetaWrap pattern.

Here is an example of the pattern in use. Looks remarkably like XAML huh? 🙂 Not bad for something built in late 1999.

Here is screenshot for a tool developed by the MetaWrap project for generating those patterns.

Adding XPath to Metawrap – Part I

About James McParlane

Leave a comment Cancel reply

Recent Posts

Archives

Categories

Meta