Thursday, January 10, 2013

Gone to Posterous

I should have written this a long time ago, but I had so much trouble putting code examples on this Blogger blog that I switched over to Posterous .

later - well now Posterous has gone, I've set up my own blog using XQuery and eXist-db.

even later I've moved to using Tumblr

Sunday, May 24, 2009

Weather Data on the Web

In preparation for our cruise up to Scotland this summer, I'm setting up some SMS services so I can get weather reports on board, provided we're in mobile phone range. This is based on the two-way service rented from Clickatell. I recently rewrote a PHP/MySQL router which routes MO calls to an application based on the first word of the message, and returns the reply, if any, to the originator. Much simpler in XQuery because the routing table is now just a simple XML configuration file and the XQuery code is much cleaner.

So far I've written services to get the UK shipping forecast, the UK inshore waters forecast and the latest weather conditions at weather buoys. Each has presented different challenges to acquire the raw data, both technical and legal. In the domain of weather information at least we seem a very long way from an integrated, easy to use, web of data.

First the inshore waters forecast. The only publicly available format is this web page. The Met Office does provide a few RSS feeds but none for shipping forecasts. This web page looks technically promising for analysis even if I'm unsure of the legal status of this act. I'd like to know how the Met Office is funded currently but failed to discover from a Google quick search. I'd like to know the extent to which this is 'Our Data' and despite the Met Office legal notices and Freedom of Information pages, I'm none the wiser really. I console myself with the fact that I'm only playing with no intention to produce a commercial service in competition with the Met Offices own services.

The Inshore waters page looks promising, with sections for each area split into meaningful headings. However on closer inspection the page suffers from that increasingly common bane of the scrapper, a complex mixture of data and JavaScript. The page appearance is the result of JavaScript processing of bland text. Here is the raw forecast for my bit of the coast:

Lands End to St Davids Head including the Bristol Channel

24 hour forecast:
Variable 3 or 4.
Slight becoming moderate later in southwest.
Fair.
Moderate or good.



Outlook: 
Variable 3 or 4, becoming west or northwest 4 or 5, occasionally 6 later.
Slight or moderate.
Thundery rain or showers.
Moderate or good.


Well now. Firstly, this is not all the data in the displayed section; the time span and the strong winds warning(if any) are elsewhere in the HTML. The nice sections are not there: instead the four parts of the forecast separated by fullstops - so the last sentence 'Moderate or good' is the Visibility. Second, the limits of the areas are identified by place identifiers in the maplet, but these do not appear in the text, and only the full area name can be used to identify. Of course, the ardent scraper can cope with this. I've been forced to add my own area ids however to support the SMS interface:

Lands End to St Davids Head

But it's horrible, unstable and makes me wonder if this design is a form of obfuscation. I suppose if they wanted to, they could switch randomly between different HTML/JavaScript layers generating the same appearance and then scrappers would be really stuffed - thankfully that seems not be be the case.

Next stop, the shipping forecast. In this case the forecast text is not on the page at all but in a generated JavaScript file which defines JavaScript arrays and their values. In an way that's simpler because I just have to fetch the JavaScript source and parse it. This application and its design is described in detail in the XQuery Wikibook.

Over in the States, their freedom of information creates a very different data climate, and NOAA provides a wonderful array of RSS and XML feeds. However, reusing even this data is not without its problems. One set of feeds I want to tap into are the data from weather buoys around the world. Many are operated by NOAA and others by local Met services or commercial operations. The UK coverage shows the locations and identifiers for UK station and there is an RSS feed of the current conditions at a buoy. The nearest up-weather buoy to Bristol is 62303, Pembroke Buoy. Well this is certainly easily accessible and valid RSS - but ... all the useful data is CDATA text in the description element:

May 24, 2009 0700 UTC

Location: 51.603N 5.1W

Wind Direction: SE (140°)

Wind Speed: 5 knots

Significant Wave Height: 3 ft

Atmospheric Pressure: 30.14 in (1020.8 mb)

Pressure Tendency: +0.03 in (+1.0 mb)

Air Temperature: 51°F (10.8°C)

Dew Point: 49°F (9.3°C)

Water Temperature: 52°F (11.1°C)


So to separate this into meaningful data with semantic markup requires string parsing to extract the data, conversion to standard formats (the date for example) and markup in some XML Schema. Again XQuery can do the analysis. Here is the Pembroke Buoy current data. The data is augmented with some additional derived data, the wind strength on the Beaufort scale.

Of course it would be better to use existing XML schemas or RDF vocabularies than invent my own. However there doesn't seem to be anything in use which fits the bill There is some work on XML schemas for research data interchange but nothing for simple observations and forecasts that I could find. Perhaps the most comprehensive set of element names on which to build is to be found in the NOAA observation XML feeds such as this for Central Park, New York. This is a prime example of how data could be provided and its delightfully simple use makes it a good candidate for student exercises. In this format, both formatted strings and atomic values are provided. In contrast to the use of an attribute for unit, the element name is a concatenation of measurement name and unit, which seems somewhat problematic to me. The data has an attached XML schema but curiously the data is not valid according to this schema. Instead of omitting missing or undefined values, as the schema requires, the text NA is used instead. I emailed the office responsible for this data and was informed that they decided to do this because they got too many enquiries about missing data so they added the NA to make it clear it was really missing! There certainly seems to be a genuine problem there for users who don't read the schema, but my follow-up question as to why, in that case, they didn't change the schema went unanswered.

Weather data represents a case where the domain is generally understood and of interest, large quantities of data are being generated and the data is of critical importance to many users, making it an ideal case study in the web of data for my students. Despite widespread discussion of XML and RDF standards, practical data mashups must rely on hand-coded scrapping, home-build vocabularies and data extracted on dodgy legal grounds. Surely we can do better.

Wednesday, May 13, 2009

Twitter Radio

Thought I'd try to get my XQuery Twitter Radio application going to listen to the tweets from the Mark Logic conference. It's only a simple script, requires Opera with Voice enabled and uses http-equiv="refresh" to refresh the page. It only works if the window is active, so it rather limits my use of the computer - just need another to run the radio I guess. If I wasn't marking, I'd write an AJAX-based version. XHTML+Voice is quite tricky to get right however.

Twitter Radio on #mluc09

I rather like the idea of following the Mark Logic conference with an eXist-based mashup - perhaps we should organise an eXist conference in Bristol - with my part-time status next academic year, perhaps I should put some effort into an event in Bristol.

Wednesday, May 06, 2009

Matching sequences in XQuery

Collation is a core algorithm in processing sequences. In XQuery, the straight-forward expression of the algorithm is as a recursive function:



declare function local:merge($a, $b as item()*)
as item()* {
if (empty($a) and empty($b))
then ()
else if (empty ($b) or $a[1] lt $b[1])
then ($a[1], local:merge(subsequence($a, 2), $b))
else if (empty($a) or $a[1] gt $b[1])
then ($b[1], local:merge($a, subsequence($b,2)))
else (: matched :)
($a[1], $b[1],
local:merge(subsequence($a,2),
subsequence($b,2)))
};



Coincidently, Dan McCreary was writing an article in the XQuery wikibook on matching sequences using iteration over one sequence and indexing into the second. The task is to locate missing items. Collation is one approach to this task, albeit requiring that the sequences are in order.

Here is a test suite comparing three methods of sequence comparison.

I also did some volume tests with two sequences differing by a single, central value. Here are the tests on a sequence of 500 items. In summary, the timings are :

* Iteration with lookup: 6984 ms - not repeatable - average is 2600
* Iteration with qualified expression: 1399 ms
* Recursive collate: 166 ms

The collate result is surprising and rather impressive. Well done eXist!

Friday, May 01, 2009

More XQuery performance tests

I noticed this morning that Dan had added an alternative implementation to an article in the XQuery Wikibook on matching words against a list. It got me wondering which implementation was preferable. I wrote a few tests and was surprised at the result. My initial implementation based on element comparisons was five times slower than comparing with a sequence of atoms, and Dan's suggestion of using a qualified expression was worse still.

Here is the test run and the Wikibook article.

Monday, April 27, 2009

XQuery Unit Tests

I had a fright last week - Wolfgang asked for a copy of the test harness I'd used to evaluate different implementations of a lookup table. This is code I wrote some time ago, tinkered with, good enough for our internal use but ... well pretty bad code.

I have to confess here that as a lone XQuery programmer, my code doesn't get the level of critique it needs. The Wikibook has been disappointing in that regard: I've published thousands of lines of code there and there has not been a single criticism or improvement posted. Typos in the descriptions are occasionally corrected by helpful souls, graffiti erased by others but as a forum for honing coding skills - forget it. In my task as project leader on our FOLD project (now coming to an end), I see and review lots of my students' code as well as the code Dan McCreary contributes to the WikiBook so I do quite a bit of reviewing. However I am only too conscious of the lacunae in my XQuery knowledge which perhaps through over kindness or because everyone is so busy, remain for too long. I'm envious of my agile friends who have been pair-programming for years. Perhaps there should be a site to match up lonely programmers for occasional pairing.

Anyway the test suite got a bit of work on it one day last week and its looking a bit better.

Here is a sample test script . As a test script to test the test runner it has the unusual property that some failed tests are good since failing is what's being tested. Here is it running.
Here is another, used to test the lookup implementations and one to test the geodesy functions.

Version 1 of the test runner executed tests and generated a report in parallel. A set of tests may have a common set of modules to import, prefix and suffix code. For each test, modules are dynamically loaded, the code concatenated and then evaled inside a catch.


let $extendedCode := concat($test/../prolog,$test/code,$test/../epilog)
let $output := util:catch("*",util:eval($extendedCode),Compile error)


The output is compared with a number of expected values. Comparison may be string-based, element-based, substring present or absent. (I also need to add numerical comparison with defined tolerance.) A test must meet all expectations to pass.

To get a summary of the results requires either running the sequence of tests recursively or constructing the test results as a constructed element and then analysing the results. Recursion would be suitable for a simple sum of passes and fails, but it closely binds the analysis to the testing. An intermediate document decouples testing from reporting, thus providing greater flexibility in the analysis but requiring temporary documents.

So version 2 constructed a sequence of test results, and then merged these results with the original test set to generate the report. Collating two sequences is a common idiom which in functional languages must either recurse over both, or iterate over one sequence whilst indexing into the other, or iterate over a extracted common key and index into both. The reporting is currently done in XQuery but it should be possible to use XSLT. Either the collating would need to be done before the XSLT step or XSLT would have the collating task. Not a happy situation.

So last week in comes version 3. Now the step which executes the tests augments each test with new attributes (pass, timing) and elements (output) and similarly each expectation with the results of its evaluation so that one single, enhanced document is produced, with the same schema as the original [the augmented data has to be optional anyway since some tests may be tagged to be ignored]. Transformation of the complete document to HTML is then straightforward either in line,in a pipeline with XQuery or XSLT. The same transformation can be run on the un-executed test set.

Augmenting the test set is slightly harder in XQuery than it would be in XSLT. For example, after executing each test, the augmented test is recreated with:


element test {
$test/@*,
attribute pass {$pass},
attribute timems {$timems},
$test/(* except expected),
element output {$output},
$expectedResults


This approach means that, once again, handling the construction of temporary documents is a key requirement for XQUery applications.

But I'm still not quite happy with version 3. As so often I'm struggling with namespaces in the test scripts - now where's my pair?

Sunday, April 19, 2009

Implementing a table look-up in XQuery

Handling temporary XML fragments in the eXist XML db has improved markedly in version 1.3. I have been looking again at an example of processing MusicXML documents which I first wrote up in the XQuery wikibook.

The code requires a translation from the note name (A, B) to the midi note value for each note. The pitch of a note is defined by a structure like:

<pitch>
<step>C</step>
<alter>1</alter>
<octave>3</octave>
</pitch>



One approach is to use an if -then -else construct:

declare function local:MidiNote($thispitch as element(pitch) ) as xs:integer
{
let $step := $thispitch/step
let $alter :=
if (empty($thispitch/alter)) then 0
else xs:integer($thispitch/alter)
let $octave := xs:integer($thispitch/octave)
let $pitchstep :=
if ($step = "C") then 0
else if ($step = "D") then 2
else if ($step = "E") then 4
else if ($step = "F") then 5
else if ($step = "G") then 7
else if ($step = "A") then 9
else if ($step = "B") then 11
else 0
return 12 * ($octave + 1) + $pitchstep + $alter
} ;

but this cries out for a table lookup as a sequence:

declare variable $noteStep :=
(
<note name="C" step="0"/>,
<note name="D" step="2"/>,
<note name="E" step="4"/>,
<note name="F" step="5"/>,
<note name="G" step="7"/>,
<note name="A" step="9"/>,
<note name="B" step="11"/>
);

declare function local:MidiNote($thispitch as element(pitch) ) as xs:integer
{
let $alter := xs:integer(($thispitch/alter,0)[1])
let $octave := xs:integer($thispitch/octave)
let $pitchstep := xs:integer($noteStep[@name = $thispitch/step]/@step)
return 12 * ($octave + 1) + $pitchstep + $alter
} ;


or an XML element:

declare variable $noteStep :=
<steps>
<note name="C" step="0"/>
<note name="D" step="2"/>
<note name="E" step="4"/>
<note name="F" step="5"/>
<note name="G" step="7"/>
<note name="A" step="9"/>
<note name="B" step="11"/>
</steps>;

declare function local:MidiNote($thispitch as element(pitch) ) as xs:integer
{
let $alter := xs:integer(($thispitch/alter,0)[1])
let $octave := xs:integer($thispitch/octave)
let $pitchstep := xs:integer($noteStep/note[@name = $thispitch/step]/@step)
return 12 * ($octave + 1) + $pitchstep + $alter
} ;


We could also store the table in the database since it is constant.

eXist does some optimisation of XPath expressions, but it does not factor out the invariant expression $thispitch/step
in the XPath predicate.

I wrote a test suite to time these various implementations. Typically this shows that factoring the sub-expression reduces the execution time by 25%. However, even with this optimisation, the structure lookup is disappointingly slow. It is about 50% slower than the if/then expression when stored on disk, and 100% slower when in memory.

This aspect of XQuery performance is important if XQuery is to be used for general application development since data structures such as indexed and associative arrays have to be represented as sequences of atomic values or elements. This performance is not really surprising and there may be more performance to be gained by indexing the data base element.