Sunday, May 24, 2009

Weather Data on the Web

In preparation for our cruise up to Scotland this summer, I'm setting up some SMS services so I can get weather reports on board, provided we're in mobile phone range. This is based on the two-way service rented from Clickatell. I recently rewrote a PHP/MySQL router which routes MO calls to an application based on the first word of the message, and returns the reply, if any, to the originator. Much simpler in XQuery because the routing table is now just a simple XML configuration file and the XQuery code is much cleaner.

So far I've written services to get the UK shipping forecast, the UK inshore waters forecast and the latest weather conditions at weather buoys. Each has presented different challenges to acquire the raw data, both technical and legal. In the domain of weather information at least we seem a very long way from an integrated, easy to use, web of data.

First the inshore waters forecast. The only publicly available format is this web page. The Met Office does provide a few RSS feeds but none for shipping forecasts. This web page looks technically promising for analysis even if I'm unsure of the legal status of this act. I'd like to know how the Met Office is funded currently but failed to discover from a Google quick search. I'd like to know the extent to which this is 'Our Data' and despite the Met Office legal notices and Freedom of Information pages, I'm none the wiser really. I console myself with the fact that I'm only playing with no intention to produce a commercial service in competition with the Met Offices own services.

The Inshore waters page looks promising, with sections for each area split into meaningful headings. However on closer inspection the page suffers from that increasingly common bane of the scrapper, a complex mixture of data and JavaScript. The page appearance is the result of JavaScript processing of bland text. Here is the raw forecast for my bit of the coast:

Lands End to St Davids Head including the Bristol Channel

24 hour forecast:
Variable 3 or 4.
Slight becoming moderate later in southwest.
Moderate or good.

Variable 3 or 4, becoming west or northwest 4 or 5, occasionally 6 later.
Slight or moderate.
Thundery rain or showers.
Moderate or good.

Well now. Firstly, this is not all the data in the displayed section; the time span and the strong winds warning(if any) are elsewhere in the HTML. The nice sections are not there: instead the four parts of the forecast separated by fullstops - so the last sentence 'Moderate or good' is the Visibility. Second, the limits of the areas are identified by place identifiers in the maplet, but these do not appear in the text, and only the full area name can be used to identify. Of course, the ardent scraper can cope with this. I've been forced to add my own area ids however to support the SMS interface:

Lands End to St Davids Head

But it's horrible, unstable and makes me wonder if this design is a form of obfuscation. I suppose if they wanted to, they could switch randomly between different HTML/JavaScript layers generating the same appearance and then scrappers would be really stuffed - thankfully that seems not be be the case.

Next stop, the shipping forecast. In this case the forecast text is not on the page at all but in a generated JavaScript file which defines JavaScript arrays and their values. In an way that's simpler because I just have to fetch the JavaScript source and parse it. This application and its design is described in detail in the XQuery Wikibook.

Over in the States, their freedom of information creates a very different data climate, and NOAA provides a wonderful array of RSS and XML feeds. However, reusing even this data is not without its problems. One set of feeds I want to tap into are the data from weather buoys around the world. Many are operated by NOAA and others by local Met services or commercial operations. The UK coverage shows the locations and identifiers for UK station and there is an RSS feed of the current conditions at a buoy. The nearest up-weather buoy to Bristol is 62303, Pembroke Buoy. Well this is certainly easily accessible and valid RSS - but ... all the useful data is CDATA text in the description element:

May 24, 2009 0700 UTC

Location: 51.603N 5.1W

Wind Direction: SE (140°)

Wind Speed: 5 knots

Significant Wave Height: 3 ft

Atmospheric Pressure: 30.14 in (1020.8 mb)

Pressure Tendency: +0.03 in (+1.0 mb)

Air Temperature: 51°F (10.8°C)

Dew Point: 49°F (9.3°C)

Water Temperature: 52°F (11.1°C)

So to separate this into meaningful data with semantic markup requires string parsing to extract the data, conversion to standard formats (the date for example) and markup in some XML Schema. Again XQuery can do the analysis. Here is the Pembroke Buoy current data. The data is augmented with some additional derived data, the wind strength on the Beaufort scale.

Of course it would be better to use existing XML schemas or RDF vocabularies than invent my own. However there doesn't seem to be anything in use which fits the bill There is some work on XML schemas for research data interchange but nothing for simple observations and forecasts that I could find. Perhaps the most comprehensive set of element names on which to build is to be found in the NOAA observation XML feeds such as this for Central Park, New York. This is a prime example of how data could be provided and its delightfully simple use makes it a good candidate for student exercises. In this format, both formatted strings and atomic values are provided. In contrast to the use of an attribute for unit, the element name is a concatenation of measurement name and unit, which seems somewhat problematic to me. The data has an attached XML schema but curiously the data is not valid according to this schema. Instead of omitting missing or undefined values, as the schema requires, the text NA is used instead. I emailed the office responsible for this data and was informed that they decided to do this because they got too many enquiries about missing data so they added the NA to make it clear it was really missing! There certainly seems to be a genuine problem there for users who don't read the schema, but my follow-up question as to why, in that case, they didn't change the schema went unanswered.

Weather data represents a case where the domain is generally understood and of interest, large quantities of data are being generated and the data is of critical importance to many users, making it an ideal case study in the web of data for my students. Despite widespread discussion of XML and RDF standards, practical data mashups must rely on hand-coded scrapping, home-build vocabularies and data extracted on dodgy legal grounds. Surely we can do better.

Wednesday, May 13, 2009

Twitter Radio

Thought I'd try to get my XQuery Twitter Radio application going to listen to the tweets from the Mark Logic conference. It's only a simple script, requires Opera with Voice enabled and uses http-equiv="refresh" to refresh the page. It only works if the window is active, so it rather limits my use of the computer - just need another to run the radio I guess. If I wasn't marking, I'd write an AJAX-based version. XHTML+Voice is quite tricky to get right however.

Twitter Radio on #mluc09

I rather like the idea of following the Mark Logic conference with an eXist-based mashup - perhaps we should organise an eXist conference in Bristol - with my part-time status next academic year, perhaps I should put some effort into an event in Bristol.

Wednesday, May 06, 2009

Matching sequences in XQuery

Collation is a core algorithm in processing sequences. In XQuery, the straight-forward expression of the algorithm is as a recursive function:

declare function local:merge($a, $b as item()*)
as item()* {
if (empty($a) and empty($b))
then ()
else if (empty ($b) or $a[1] lt $b[1])
then ($a[1], local:merge(subsequence($a, 2), $b))
else if (empty($a) or $a[1] gt $b[1])
then ($b[1], local:merge($a, subsequence($b,2)))
else (: matched :)
($a[1], $b[1],

Coincidently, Dan McCreary was writing an article in the XQuery wikibook on matching sequences using iteration over one sequence and indexing into the second. The task is to locate missing items. Collation is one approach to this task, albeit requiring that the sequences are in order.

Here is a test suite comparing three methods of sequence comparison.

I also did some volume tests with two sequences differing by a single, central value. Here are the tests on a sequence of 500 items. In summary, the timings are :

* Iteration with lookup: 6984 ms - not repeatable - average is 2600
* Iteration with qualified expression: 1399 ms
* Recursive collate: 166 ms

The collate result is surprising and rather impressive. Well done eXist!

Friday, May 01, 2009

More XQuery performance tests

I noticed this morning that Dan had added an alternative implementation to an article in the XQuery Wikibook on matching words against a list. It got me wondering which implementation was preferable. I wrote a few tests and was surprised at the result. My initial implementation based on element comparisons was five times slower than comparing with a sequence of atoms, and Dan's suggestion of using a qualified expression was worse still.

Here is the test run and the Wikibook article.

Monday, April 27, 2009

XQuery Unit Tests

I had a fright last week - Wolfgang asked for a copy of the test harness I'd used to evaluate different implementations of a lookup table. This is code I wrote some time ago, tinkered with, good enough for our internal use but ... well pretty bad code.

I have to confess here that as a lone XQuery programmer, my code doesn't get the level of critique it needs. The Wikibook has been disappointing in that regard: I've published thousands of lines of code there and there has not been a single criticism or improvement posted. Typos in the descriptions are occasionally corrected by helpful souls, graffiti erased by others but as a forum for honing coding skills - forget it. In my task as project leader on our FOLD project (now coming to an end), I see and review lots of my students' code as well as the code Dan McCreary contributes to the WikiBook so I do quite a bit of reviewing. However I am only too conscious of the lacunae in my XQuery knowledge which perhaps through over kindness or because everyone is so busy, remain for too long. I'm envious of my agile friends who have been pair-programming for years. Perhaps there should be a site to match up lonely programmers for occasional pairing.

Anyway the test suite got a bit of work on it one day last week and its looking a bit better.

Here is a sample test script . As a test script to test the test runner it has the unusual property that some failed tests are good since failing is what's being tested. Here is it running.
Here is another, used to test the lookup implementations and one to test the geodesy functions.

Version 1 of the test runner executed tests and generated a report in parallel. A set of tests may have a common set of modules to import, prefix and suffix code. For each test, modules are dynamically loaded, the code concatenated and then evaled inside a catch.

let $extendedCode := concat($test/../prolog,$test/code,$test/../epilog)
let $output := util:catch("*",util:eval($extendedCode),Compile error)

The output is compared with a number of expected values. Comparison may be string-based, element-based, substring present or absent. (I also need to add numerical comparison with defined tolerance.) A test must meet all expectations to pass.

To get a summary of the results requires either running the sequence of tests recursively or constructing the test results as a constructed element and then analysing the results. Recursion would be suitable for a simple sum of passes and fails, but it closely binds the analysis to the testing. An intermediate document decouples testing from reporting, thus providing greater flexibility in the analysis but requiring temporary documents.

So version 2 constructed a sequence of test results, and then merged these results with the original test set to generate the report. Collating two sequences is a common idiom which in functional languages must either recurse over both, or iterate over one sequence whilst indexing into the other, or iterate over a extracted common key and index into both. The reporting is currently done in XQuery but it should be possible to use XSLT. Either the collating would need to be done before the XSLT step or XSLT would have the collating task. Not a happy situation.

So last week in comes version 3. Now the step which executes the tests augments each test with new attributes (pass, timing) and elements (output) and similarly each expectation with the results of its evaluation so that one single, enhanced document is produced, with the same schema as the original [the augmented data has to be optional anyway since some tests may be tagged to be ignored]. Transformation of the complete document to HTML is then straightforward either in line,in a pipeline with XQuery or XSLT. The same transformation can be run on the un-executed test set.

Augmenting the test set is slightly harder in XQuery than it would be in XSLT. For example, after executing each test, the augmented test is recreated with:

element test {
attribute pass {$pass},
attribute timems {$timems},
$test/(* except expected),
element output {$output},

This approach means that, once again, handling the construction of temporary documents is a key requirement for XQUery applications.

But I'm still not quite happy with version 3. As so often I'm struggling with namespaces in the test scripts - now where's my pair?

Sunday, April 19, 2009

Implementing a table look-up in XQuery

Handling temporary XML fragments in the eXist XML db has improved markedly in version 1.3. I have been looking again at an example of processing MusicXML documents which I first wrote up in the XQuery wikibook.

The code requires a translation from the note name (A, B) to the midi note value for each note. The pitch of a note is defined by a structure like:


One approach is to use an if -then -else construct:

declare function local:MidiNote($thispitch as element(pitch) ) as xs:integer
let $step := $thispitch/step
let $alter :=
if (empty($thispitch/alter)) then 0
else xs:integer($thispitch/alter)
let $octave := xs:integer($thispitch/octave)
let $pitchstep :=
if ($step = "C") then 0
else if ($step = "D") then 2
else if ($step = "E") then 4
else if ($step = "F") then 5
else if ($step = "G") then 7
else if ($step = "A") then 9
else if ($step = "B") then 11
else 0
return 12 * ($octave + 1) + $pitchstep + $alter
} ;

but this cries out for a table lookup as a sequence:

declare variable $noteStep :=
<note name="C" step="0"/>,
<note name="D" step="2"/>,
<note name="E" step="4"/>,
<note name="F" step="5"/>,
<note name="G" step="7"/>,
<note name="A" step="9"/>,
<note name="B" step="11"/>

declare function local:MidiNote($thispitch as element(pitch) ) as xs:integer
let $alter := xs:integer(($thispitch/alter,0)[1])
let $octave := xs:integer($thispitch/octave)
let $pitchstep := xs:integer($noteStep[@name = $thispitch/step]/@step)
return 12 * ($octave + 1) + $pitchstep + $alter
} ;

or an XML element:

declare variable $noteStep :=
<note name="C" step="0"/>
<note name="D" step="2"/>
<note name="E" step="4"/>
<note name="F" step="5"/>
<note name="G" step="7"/>
<note name="A" step="9"/>
<note name="B" step="11"/>

declare function local:MidiNote($thispitch as element(pitch) ) as xs:integer
let $alter := xs:integer(($thispitch/alter,0)[1])
let $octave := xs:integer($thispitch/octave)
let $pitchstep := xs:integer($noteStep/note[@name = $thispitch/step]/@step)
return 12 * ($octave + 1) + $pitchstep + $alter
} ;

We could also store the table in the database since it is constant.

eXist does some optimisation of XPath expressions, but it does not factor out the invariant expression $thispitch/step
in the XPath predicate.

I wrote a test suite to time these various implementations. Typically this shows that factoring the sub-expression reduces the execution time by 25%. However, even with this optimisation, the structure lookup is disappointingly slow. It is about 50% slower than the if/then expression when stored on disk, and 100% slower when in memory.

This aspect of XQuery performance is important if XQuery is to be used for general application development since data structures such as indexed and associative arrays have to be represented as sequences of atomic values or elements. This performance is not really surprising and there may be more performance to be gained by indexing the data base element.

Wednesday, April 08, 2009

XQuery module for Geodesy

I wrote my first attempt at Mercator ~ Latitude/Longitude conversion functions about 2 years ago when working in a case study for SPA2007. Part of this was documented in the XQuery Wikibook article on UK bus stops and Ordnance Survey coordinates. At the time I did not appreciate why my coordinates were just a bit off but fudged the difference. Last month I used the same functions to map pedal cycle data but as gardens and roofs appeared to be much more dangerous than roads, I thought I'd better try harder and learnt about Helmert Transformations.

My latest attempt is now available in the XQuery Examples Google Project and the Wikibook article has been revised to use this module. The formulae come mainly from the OS Guide to coordinate systems. PHP Code on Barry Hunter's site was also useful.The test suite for this module is still being added to. I have struggled with rounding, needed to get positions with sensible resolutions and for testing and I'm not yet happy. Some tests need visual inspection and there is a problem with heights.

The module depends on the eXist math module, a prime candidate for cross-implementation standardization by the EXQuery initiative. In the latest version (v1-3) of the module, the math:atan2() function has parameters in the correct order (y,x) but older releases had these parameters reversed,as in v1-2.

The design of the module uses elements with attributes in the same namespace as the module to define compound structures such as LatLongs, Ellipsoids and Projections. These are defined in an associated schema. Compile-time checking in eXist is limited to checking the element name since eXist is not schema-aware although full schema-awareness would be of benefit in this module.

Suggestions for additions to this module are most welcome, as of course is any review of the code.

Sunday, April 05, 2009

Dashboards and Widgets in XQuery

Jim Fuller's recent article on Dashboards in XQuery makes a very good case for using XQuery for mashups generally. Jim's dashboard application reminded me of work I had been doing with my students last term on a configurable web page containing widgets to display NOAA weather data, RSS feeds, Google Maps and their own choice of data source. For this we used PHP with Simple XML, but to demonstrate the power of XQuery, I needed to show them the same application built on XQuery. It also seemed to me that the business dashboard would benefit from a design which split widget code from company-specific data.

The basic architecture of the approach here is to create a set of generalised widgets as XQuery functions and to define the specifics of the widget in a configuration file.

Here are a couple of configurations : jim.xml which is based on Jim's Dashboard example

and dsa.xml which is based on our course weather display.

The second example has a page refresh rate set, but I'm working on making the refresh rate widget- rather than page- specific using AJAX.

In the demo interface code, each widget links to its XQuery source code and hence to the code of called functions. Widget executions are timed and there is a link to the configuration file itself. Here is a basic main script:

import module namespace widgets = "" at "widgets.xqm";

declare option exist:serialize "method=xhtml media-type=text/html omit-xml-declaration=no indent=yes
doctype-public=-//W3C//DTD XHTML 1.0 Transitional//EN

let $configuri := request:get-parameter("config",())
let $config := doc($configuri)/config
<link rel="stylesheet" href="{$config/@css}" type="text/css"/>
<h1>{$config/title/text()} </h1>
for $section in $config/section
let $component := $section/*
<h3>{$section/@title/string()} </h3>

Run it

There are a couple of ideas used in the code.

Each widget is defined by its own section in the configuration file. As a simple example, to configure the widget to get the date and time:

<section column="c2" title="Time in Bristol">
<format>EE dd/MM HH:mm</format>

The main script processes the configuration file, and each widget is rendered by calling a function of the same name. The code for this dispatching currently uses the eXist function util:eval(), within a util:catch call, to implement late binding:

declare function widgets:render($section as element(section)) as element(div) {
(: dispatch section to the matching function :)
let $component := $section/*[1]
let $widget := local-name($component)
let $function := concat("widgets:", $widget, "($component)")
{util:catch( "*", util:eval($function), <span>Missing or bad widget.</span>) }

A safer alternative would be to use typeswitch :

declare function widgets:render2($section as element(section)) as element(div) {
let $component := $section/*[1]
typeswitch ($component)
case element(datatime) return widgets:datetime($component)
case element(SQL) return widgets:SQL($component)
case element(monitor) return widgets:monitor($component)
default return
<div> Missing widget {local-name($component)} </div>

but this needs updating every time a new widget is added to the module.

To help with processing table data from different sources such as SQL, Excel and Google Spreadsheets, support functions transform these to a common XML structure. This standard form is then transformed to its final HTML with the XSLT included in the configuration.

For example here is the configuration for an SQL widget, reading data from a (readonly) MySQL database:

<section column="c2" title="Overpaid executives">
<query>select ename,sal from emp where sal > 2000 order by sal desc</query>
<xsl:stylesheet xmlns:xsl="" version="2.0">
<xsl:template match="table">
<table border="1">
<xsl:apply-templates select="row"/>
<xsl:template match="row">
<xsl:value-of select="ename"/>
<xsl:value-of select="sal"/>

and its widget code:

declare function widgets:SQL($component as element(SQL)) as element (div) {
let $connection := sql:get-connection("com.mysql.jdbc.Driver", $component/database, $component/username, $component/password)
let $result := sql:execute($connection, $component/query, false())
let $table:= widgets:sql-result-to-table($result)
return transform:transform($table,$component/xsl/*,())

widgets:sql-result-to-table converts the sql result to a common internal XML format:

declare function widgets:sql-result-to-table($result as element(sql:result)) as element(table) {
{for $row in $result/sql:row
{for $col in $row/sql:field
element {$col/@name}

This simplifies the XSLT transformation to the final HTML div element.

There's more to do on this framework, including adding support for access to passworded Google SS, Calendar and RSS feeds for which Jim's code will be handy. The framework also needs to support widgets from other modules and I'm not sure about the intermediate XML format and I'd like to use this more widely to layer the weather widgets as well. Getting the right split between widget code and configuration data is always tricky of course. However this framework seems quite useful and I intend to revisit some of the WikiBook examples to restructure as widgets.

I'll post the full code to the XQuery Wikibook shortly.

Wednesday, April 01, 2009

Parameterised MS Word Documents with XQuery

It's coming round to exam time again at UWE, Bristol and as usual I've been struggling to get mine written. The XQuery-based FOLD application (which supports staff and students in our School) generates exam front pages contain exam details such as module code and title, examination date, length and time as HTML which had to be copied (poorly) into MS Word. This wasn't very satisfactory and it would be better to generate a Word document with the completed front page and sample pages with headers and footers. I'd put this off as it seemed too complicated. The Word XML format wordml is one route but it looked daunting to generate for scratch.

However for this application I only need to make some small edits to a base document. The most obvious approach was to 'parameterise' the Word document with place-holders. Unique place-holders can be edited in with Word before the document is saved as XML. Fields which are not editable in MS Word, such as the author and timestamps can be parameterised by editing the wordml directly. To instantiate a new Word document, the place-holders in the wordml are replaced with their values.

Treating this as string replacement is easier than editing the XML directly, even if this was possible in XQuery. The XQuery script reads the wordml document, serializes the XML as a string, replaces the placeholders in the string with their values and then converts back to XML for output.

Although this is not a typical task for XQuery and would be written in a similar way in other scripting languages, it is possible in XQuery with the help of a pair of functions which should be part of a common XQuery function library. In eXist these are util:serialize() to convert from XML to a string and the inverse, util:parse().

The function needs to replace multiple strings so we use a an XML element to define the name/value pairs:

let $moduleCode := request:get-parameter("moduleCode",())
let $replacement :=
<replace string="F_ModuleCode" value="{$moduleCode}"/>
<replace string="F_Title" value="{$title}"/>
<replace string="F_LastAuthor" value="FOLD"/>

and a recursive function to do the replacements:

declare function local:replace($string,$replacements) {
if (empty($replacements))
then $string
let $replace := $replacements[1]
let $rstring := replace($string,string($replace/@string),string($replace/@value))

After gathering the parameter values and formatting a replacement element, the new document is generated by:

let $template := doc("/db/FOLD/doc/examtemplate.xml")
let $stemplate := util:serialize($template,"method=xml")
let $mtemplate := local:replace($stemplate,$replaceStrings/*)

Here the generated wordml is displayed in the browser, from where it can be saved, then loaded into Word. I found out the directive at the front of the wordml:

<?mso-application progid="Word.Document"?>

is used by the Windows OS to associate the file with MS Word so the media type is just the standard text/xml. However it is helpful to define a suitable default file name using a function in eXist's HTTP response module, the pair to the request module used to access URL parameters:

let $dummy := response:set-header('Content-Disposition', concat('attachment;filename=',concat("Exam_",$moduleCode,".xml") ))
let $dummy := response:set-header('Content-Type','application/msword')

The document could also be saved directly to the database, or all documents generated eagerly ready for use.

This approach feels like a bit of a hack, but it took only an hour to develop and is a major improvement on the previous approach. Changes to the base document will need re-parameterisation, but that seems a small overhead for slowly changing standard documents. XQuery forces a recursive approach to the string replacements where an iterative updating approach would avoid copying the (rather large) string, but performance is fast enough for this task, indeed in eXist string handling is very fast. My MS Office users will be happier but I still need to think about the Unix and Mac OS users.

Review of IBM developerWorks article by Brian Carey

I've just come across an article published by IBM's developerWorks "Use XQuery for the presentation layer" by Brian Carey. This illustrates the value of storing complex data in XML form and using XQuery to select and transform to HTML. Whilst the main message is well-presented, the implementation, below layers of Java, is over-complicated in a couple of ways.

Brian makes the curious observation under the heading Using XQuery prevents "cheating" that "You cannot write business logic in XQuery because XQuery focuses exclusively on querying and transformation". This rather ignores the fact that a significant number of applications are built soley on XQuery as the server-side language. The consequence is that a simple Web application retains a complex and unnnecessary Java middle tier acting as Controller in an MVC architecture.

Brian's web application provides a browsing interface to a collection of products using AJAX to update the page and XQuery to select producats and transfrom to an HTML table.

Implemented in XQuery on the open source eXist XML database as an example, we need only use the HTTP interface functions provided to couple the HTTP requests directly to the XQuery script. For eXist the additions would be:

declare variable $docName as xs:string := "lures.xml";
declare variable $configuration as xs:string := request:get-parameter("configuration",());
declare variable $usage as xs:string:= request:get-parameter("usage",());

It might be objected that this script binds the resources and interfaces too closely to the script. Indeed the only benefit of the Java controller layer is this de-coupling. We can achieve the same effect in XQuery with a local configuration file to aid portability and testing:


and adding these lines to the XQuery script:

declare variable $config := /ConfigurationFile;
declare variable $docName := $config/lureFileName;

I've implemented Brian's code with eXist on the XQuery Wikibook server and the relevant scripts are here:

I changing only the script address in the HTML code and correcting an error in the AJAX function where request.readyState was miss-typed. (took me a while to track that down!). Middle layer all gone. Storage of the whole application in eXist would be a normal deployment but was not possible without editing because the HTML page is not valid XHTML .

One impediment to the use of XQuery as an application development language is that functions to extend the XPath function with functionality such as HTTP interfacing are implementation-dependent, limiting portability. A new inititive EXQuery seeks to remedy this problem by developing a cross-platform library.

One other feature of Brian's implementation is the structure of the XML file. The start of the file looks like

<minnows brand="Yohuri" style="deep" size="3">
<minnow color="midnight blue">

But since the data is intended to be searched for usage(e.g. casting) and configuration (e.g. minnow) this leads to XQuery code like

if ($configuration = 'minnow' and $usage = 'casting') then
for $minnows in doc($docName)//casting/minnows

else if ($configuration = 'minnow' and $usage = 'trolling') then


and because the child structures are actually all the same, this leads to unmaintableable and repetative code.

A fix is possible in XQuery using local-name to filter using the node names themselves. A better approach would be to flatten the whole file, changing the representation of the configuration and usage concepts from elements to attributes:

<lure brand="Yohuri" style="deep" size="3" color="midnight blue" usage="casting" configuration="minnow">


and the query becomes

for $lure in doc($docName)//lure[@configuration = $configuration][@usage=$usage]
return ..

In summary, this implementation might be a "cheat" but cheating by reducing complexity and lines of code seems a good thing to do.

Monday, February 23, 2009

Data Normalization

My lastest teaching program is for model inference from un-normalized data. This had its inception in a PHP tool written some years ago. The new version uses my ER XML schema to integrate the output with the other data modelling tools.

Normalisation is usually taught on data base courses via the abstract concepts of first, second third normal and higher normal forms. In my introductory module I just want to get over the basic idea of reduction in data duplication through the actual factorisation of a first normal-form table (atomic values) into a set of related third normal-form tables.

Here is the tool, written of course in XQuery:

I get the students to take a nominated data set, factorise it, generate the SQL table declaration and INSERT statements, load them into a MySQL database and then reconstruct the original table using a select statement joining all the tables. This allows the student to check that the factorisation is loss-less but of course it does not check that it is optimal. At present the tool allows the student to explore different possibilities and create any factorisation they like.

The state of the factorisation is the current dataset URI and the current factorisation, defined by an ER model. Currently these are passed between client and server in the URL. This limits the size of the model and i guess I should change to POST but the interface behavior will not be as clean (the back button works as a simple Undo) and I can't have simple generated links on buttons. I guess I need help in interface programming here.

For the record, the code is 600 lines split into 27 functions and using two functions in the larger er module to transform the XML model to DDL and an ER diagram. Code by request until I find a suitable home.

Friday, February 06, 2009

Data Modelling Tutor

The SQL tutor is now in use and seems to be finding favour with students and other tutors. There is a long list of things to add, like the ability to discuss an exercise but the course moves on and now I want to apply the same ideas to the teaching of data modelling. Students often find this rather difficult.

For the past few years we have used an excellent case tool called QSEE, developed by Mark Dixon at Leeds Metropolitan University. We have mainly used this multi-diagram tool for the ER diagrams. QSEE supports conceptual ER models and handles foreign keys, link tables and week entities when generating the SQL DDL. I have a running battle with some other tutors over the use of conceptual ER diagrams versus Relational ER Diagrams, complete with foreign keys and link tables. In my multi-paradigm teaching, conceptual models which treat the later as artefacts of a relational data model makes more sense. Of course I'd like to see a few improvements but sadly development of this tool seems to have ceased. Pity that it hasn't been open sourced.

My teaching emphases the difference between a model and its various representations as diagrams, as text and as code. Since we have already studied XML, it is natural to think of representing the conceptual data model as an XML document and writing the transformations in XSLT or XQuery. Having used Graphviz for a number of years the XML can be transformed to the dot notation to create diagrams in different diagraming conventions. Moreover the goal of providing an interactive data modeling tutor seems more easily realised by processing textual descriptions.

So this weekend, snowed in from work on the boat, I've been working on this tutor and data modelling tool. The really hard part has been to write the model comparator so that differences between a student model and the 'master' model can be detected and explained. This has to take account of variations in the names the student may use as well as differences in order of definition, so a straight XML diff isn't enough. What I have now is not quite right but it will have to be good enough if I want to get this tutor out to students this week.

So here is the link to the index of worksheets so far written:

The transformations currently supported include Chen diagrams, ER diagrams with or without attributes, SQL DDL, SQL DDL with foreign keys and many-many resolution, and a rather clunky English text representation.

One feature which was unplanned and just emerged as an obvious addition, was the ability to provide a default model template so the student could solve initial problems by filling in the blanks rather than starting with a blank page.

There is still a lot to do, but I'm pleased to have got the prototype of the ground - a long-held idea finally coming to fruition - made possible by the power of XQuery and the eXist XML database, for which I give thanks to Wolfgang and and the guys every day.

Friday, January 23, 2009

Some years ago I wrote an SQL workbook which was used on a couple of courses, but although it had an interactive interface so that that a student could test their SQL statements against the example database, the results were not checked. I planned to create an interactive site which would present tasks to the student, accept the SQL statement input, execute the statement and compare the result with that of the model answer. Since I was teaching PHP/MySQL at the time, in good dog-fooding tradition, I started on an implementation using this platform, but it got sticky and stalled. Then I discovered XQuery and the other day, wrote an XQuery/ eXist implementation.

[I'm really supposed to be marking but I find my most creative streak when marking's about - the root perhaps of the love/hate relationship I have with my job.]

The relative ease with which this version was created well illustrates the power of the XQuery/ Native XML database development approach. This application lies in the sweet spot for this technology and here is why I think that is:

XML handles Composition

Each worksheet is represented by an XML document. The document describes properties of the worksheet - database used, title, tags and each of the steps in the worksheet. It is too weak to say that steps are part of the worksheet, they -are- the worksheet. A normalised relational implementation requires one table for the worksheet properties , another for the steps and a foreign key to link these two types together. This artifical separation into master and child records complicates the model and its processing. A symptom of the need for a composition can be found in the naming problem - what do you call the master table - worksheet perhaps? But that's not right - in domain terms a worksheet is the master record AND its children as a single entity. worksheetBody? - yeech.

XML handles order
The Steps in a worksheet are ordered. To represent this in SQL requires an additional sequence number . Then editing to insert and delete steps requires re-numbering. In XML, order is implicit in the document order of the steps.

XML handles heterogeneity
Steps in a worksheet are different types. Some are simple explanations, some are demonstrations of SQL constructs, many are exercises and others have yet to be designed. A relational approach would either coerce all types into a common structure, with lots of null fields, or use multiple tables, one for each type, and a super-type table. [Object-relational databases support a more transparent implementation but who uses those?]. In XML, different types of Step can be freely interleaved in the step sequence.

XML handles multi-valued attributes

As with most of my applications these days, I use tags to support searching and classifying resources. In a normalised relational database, I should break these out into a separate table with a foreign key, but would probably fudge the problem by putting all tags into an arbitrarily concatenated field. In XML tags are simply repeated elements with no temptation to step outside the data model.

XML supports documents

Worksheets are standalone entities, which are authored, edited, deployed, removed as units. In a relational approach, all worksheets would be stored in common database tables, and the domain concept of worksheet as a unit is lost. This is particularly a problem in development: in the SQL tutor, worksheets may either be located in the XML database but may also be accessed from anywhere on the web - it makes little difference to the script providing the interactive interface. So new worksheets can be tested before installation with ease. They can also be exchanged, encouraging reuse and standardisation.

XML Documents are human-editable
My PHP/MySQL prototype stalled on the need to develop an interactive editor for worksheets in additon to the interactive tutor. In XML this component is not required, provided of course that the worksheet author is happy to author XML. If not, there are generic editors available to assist.

XQuery handles XML natively.
Inputs - the script, the SQL table returned by the SQL interface and the XHTML conent of some step properties - and the output - XHTML pages - are all XML structures. A language which handles XML natively does not have perform the re-structuring required in , say, PHP before it can be processed [PHP- XML interfaces have made this somewhat less difficult].

XQuery handles comples processing.
XQuery is much more than the XML equivalent of SQL .The SQL tutor has the task of testing whether two results are functionally the same. Two tables are the same if the column headings are different, but the data the same, or if columns are in different orders, or if rows in an SELECT statement with no ORDER BY clause are in a different order. The sequence and element structures of the XML data model combined with the expressiveness of XQuery make this computation straightforward to code.

A Native XML database supports development
I struggle with the design of the Worksheet schema. Minor additions occur all the time as my ideas develop, but since eXist is schema-independent, I can make these changes bottom-up before I need to formalise them as a schema. In a relational database I'm forced to work top-down, schema first. Moreover I feel safe that if I have a major redesign, an XSLT transformation will do the necessary schema evolution of existing Worksheets.

This case study demonstrates that a relational database is a poor technology for implementing a tutorial about SQL. This asymetry is not the case for XQuery, and the next project is to modify the tutor to do the same for XQuery expressions.

The tutor goes into use with 200 students next week although the worksheets themselves could do with more revision. Since the tutor logs all input, it will be interesting to be able to see what kinds of error are made, and then be able to adjust the hinting accordingly. Wish me luck!