Wednesday, November 28, 2007

Pipelines

Looks like its going to be the year of 'Pipes' in my teaching. I came across Yahoo pipes earlier this year and decided to restructure my second-level module, Data Schemas and Applications, around the 'web as database', before diving into local database technologies like Relational and XML databases. Yahoo Pipes provides an approachable starting point and has the added benefit of being a technology which none of my students, from a diverse collections of programmes, have yet encountered. Also, those slinky pipes and typed ports are -so- seductive.

However, the visual editor soon becomes awkward to use and thus leads naturally into using a scripting language instead. I started with XSLT expecting to move quickly into XQuery, but I've been surprised to find how much can be done, especially with XSLT2.0. For a transformation engine, I've set up a service (using Saxon8 via XQuery on eXist-db). This has allowed us to implement most of the yahoo pipes we'd written and also searches over an XML file with a single script containing a form and the search results. More..

Although many of the steps in a Yahoo pipeline can be handled within a single XSLT script, some of the processing I want to demonstrate involves processing HTML pages which are not XHTML, so I needed a tidy service too, and to be able to pipeline them together.

So... I need a pipeline language, a way of visualizing the pipeline and an engine to execute the pipeline. Naturally I started to write my own, based mainly on XPL which Eric Bruchez introduced me to at XML Prague. A tentative first step using an XQuery script is described in an XQuery WikiBook article.

Of course this is fine as play but I need to join the real world of pipeline languages. I suppose the main contenders are :
NetKernel looks theoretically and practically very interesting and what's more, 1060research are a locally-based spin-off from HP labs next door.

1 comment:

M. David Peterson said...

Hey Chris,

"Although many of the steps in a Yahoo pipeline can be handled within a single XSLT script, some of the processing I want to demonstrate involves processing HTML pages which are not XHTML, so I needed a tidy service too, and to be able to pipeline them together."

So here's a fun one,

http://personplacething.info/service/proxy/return-xml-from-html/?uri=http://www.xml.com//html:html/html:body//html:p[contains(.,'M.%20David%20Peterson')]

Live dynamic searching of the (X)HTML web for pipelining into whatever you might want. This uses an XSLT 2.0 extension function written in C# that accesses an SgmlReader with the URI specified in the URI query string param and then returns the XPath specified at the end of the URI using // as the delimiter between the URI and the XPath expression (the second / represents the root of the document)

Code is @ http://nuxleus.com/dev/browser/trunk/nuxleus/Web/Development/transform/controller/proxy/base.xslt which is driven by http://nuxleus.com/dev/browser/trunk/nuxleus/Web/Development/service/proxy/return-xml-from-html/service.op

Too bad you're not using .NET! :D ;-) Of course this same thing could be replicated using a servlet and John Cowan's TagSoup HTML > XHTML processor.