Wednesday, November 28, 2007

Pipelines

Looks like its going to be the year of 'Pipes' in my teaching. I came across Yahoo pipes earlier this year and decided to restructure my second-level module, Data Schemas and Applications, around the 'web as database', before diving into local database technologies like Relational and XML databases. Yahoo Pipes provides an approachable starting point and has the added benefit of being a technology which none of my students, from a diverse collections of programmes, have yet encountered. Also, those slinky pipes and typed ports are -so- seductive.

However, the visual editor soon becomes awkward to use and thus leads naturally into using a scripting language instead. I started with XSLT expecting to move quickly into XQuery, but I've been surprised to find how much can be done, especially with XSLT2.0. For a transformation engine, I've set up a service (using Saxon8 via XQuery on eXist-db). This has allowed us to implement most of the yahoo pipes we'd written and also searches over an XML file with a single script containing a form and the search results. More..

Although many of the steps in a Yahoo pipeline can be handled within a single XSLT script, some of the processing I want to demonstrate involves processing HTML pages which are not XHTML, so I needed a tidy service too, and to be able to pipeline them together.

So... I need a pipeline language, a way of visualizing the pipeline and an engine to execute the pipeline. Naturally I started to write my own, based mainly on XPL which Eric Bruchez introduced me to at XML Prague. A tentative first step using an XQuery script is described in an XQuery WikiBook article.

Of course this is fine as play but I need to join the real world of pipeline languages. I suppose the main contenders are :
NetKernel looks theoretically and practically very interesting and what's more, 1060research are a locally-based spin-off from HP labs next door.

Saturday, November 24, 2007

Topological sorting in XQuery

In the course of my attempts to implement an XQuery engine for the XML pipeline language XPL, I realized that I would need to sort the processes so that each process has its inputs available, i.e. a topological sort.

Given a sequence of nodes and references with the Relax-NG schema:


element node {
attribute id { xs:string },
element ref {
attribute id { xs:string }
}*
}+


The post-condition after sorting can be defined as:


declare function local:topological-sorted($nodes) as xs:boolean {
every $n in $nodes satisfies
every $id in $n/ref/@id
satisfies $id = $n/preceding::node/@id
};


and the recursive function to order the nodes is:



declare function local:topological-sort($unordered, $ordered ) {
if (empty($unordered))
then $ordered
else
let $nodes := $unordered [ every $id in ref/@id satisfies $id = $ordered/@id]
return local:topological-sort( $unordered except $nodes, ($ordered, $nodes ))
};




Sweet, eh? Even if this implementation is not the most efficient, it has the advantage of being obviously correct.

See (and improve!) the XQuery WikiBook article.

Sunday, November 18, 2007

the Prime Sieve in XQuery

I guess I've been putting my spare effort over the past few months into the Wikibook on XQuery, so the blog has been very neglected.

I've learnt a lot in writing the example code in the Wikibook. The other day I bumped in the Euler Project and started on the first few the problems. Problem 3 is about primes so I had to write a prime number generator. The Sieve of Eratosthenes in XQuery is so simple and obvious:


declare function local:sieve($primes,$nums) {
if (exists($nums))
then
let $prime := $nums[1]
return local:sieve(($primes,$prime), $nums[. mod $prime != 0])
else $primes
};

local:sieve((),2 to 1000)


The list of primes starts off empty, the list of numbers starts off with the integers. Each recursive call of local:sieve takes the first of the remaining integers as a new prime and reduces the list of integers to those not divisible by the prime. When the list of integers is exhausted, the list of primes is returned.

Lovely but sadly not very practical for the size of numbers in the problem.

Discussion and execution are in the Wikibook.