The Wallace Line: May 2009

In preparation for our cruise up to Scotland this summer, I'm setting up some SMS services so I can get weather reports on board, provided we're in mobile phone range. This is based on the two-way service rented from Clickatell. I recently rewrote a PHP/MySQL router which routes MO calls to an application based on the first word of the message, and returns the reply, if any, to the originator. Much simpler in XQuery because the routing table is now just a simple XML configuration file and the XQuery code is much cleaner.

So far I've written services to get the UK shipping forecast, the UK inshore waters forecast and the latest weather conditions at weather buoys. Each has presented different challenges to acquire the raw data, both technical and legal. In the domain of weather information at least we seem a very long way from an integrated, easy to use, web of data.

First the inshore waters forecast. The only publicly available format is this web page. The Met Office does provide a few RSS feeds but none for shipping forecasts. This web page looks technically promising for analysis even if I'm unsure of the legal status of this act. I'd like to know how the Met Office is funded currently but failed to discover from a Google quick search. I'd like to know the extent to which this is 'Our Data' and despite the Met Office legal notices and Freedom of Information pages, I'm none the wiser really. I console myself with the fact that I'm only playing with no intention to produce a commercial service in competition with the Met Offices own services.

The Inshore waters page looks promising, with sections for each area split into meaningful headings. However on closer inspection the page suffers from that increasingly common bane of the scrapper, a complex mixture of data and JavaScript. The page appearance is the result of JavaScript processing of bland text. Here is the raw forecast for my bit of the coast:

Lands End to St Davids Head including the Bristol Channel

24 hour forecast:
Variable 3 or 4.
Slight becoming moderate later in southwest.
Fair.
Moderate or good.

Outlook:
Variable 3 or 4, becoming west or northwest 4 or 5, occasionally 6 later.
Slight or moderate.
Thundery rain or showers.
Moderate or good.

Well now. Firstly, this is not all the data in the displayed section; the time span and the strong winds warning(if any) are elsewhere in the HTML. The nice sections are not there: instead the four parts of the forecast separated by fullstops - so the last sentence 'Moderate or good' is the Visibility. Second, the limits of the areas are identified by place identifiers in the maplet, but these do not appear in the text, and only the full area name can be used to identify. Of course, the ardent scraper can cope with this. I've been forced to add my own area ids however to support the SMS interface:

Lands End to St Davids Head

But it's horrible, unstable and makes me wonder if this design is a form of obfuscation. I suppose if they wanted to, they could switch randomly between different HTML/JavaScript layers generating the same appearance and then scrappers would be really stuffed - thankfully that seems not be be the case.

Next stop, the shipping forecast. In this case the forecast text is not on the page at all but in a generated JavaScript file which defines JavaScript arrays and their values. In an way that's simpler because I just have to fetch the JavaScript source and parse it. This application and its design is described in detail in the XQuery Wikibook.

Over in the States, their freedom of information creates a very different data climate, and NOAA provides a wonderful array of RSS and XML feeds. However, reusing even this data is not without its problems. One set of feeds I want to tap into are the data from weather buoys around the world. Many are operated by NOAA and others by local Met services or commercial operations. The UK coverage shows the locations and identifiers for UK station and there is an RSS feed of the current conditions at a buoy. The nearest up-weather buoy to Bristol is 62303, Pembroke Buoy. Well this is certainly easily accessible and valid RSS - but ... all the useful data is CDATA text in the description element:


May 24, 2009 0700 UTC

        Location: 51.603N 5.1W

        Wind Direction: SE (140°)

        Wind Speed: 5 knots

        Significant Wave Height: 3 ft

        Atmospheric Pressure: 30.14 in (1020.8 mb)

        Pressure Tendency: +0.03 in (+1.0 mb)

        Air Temperature: 51°F (10.8°C)

        Dew Point: 49°F (9.3°C)

        Water Temperature: 52°F (11.1°C)

So to separate this into meaningful data with semantic markup requires string parsing to extract the data, conversion to standard formats (the date for example) and markup in some XML Schema. Again XQuery can do the analysis. Here is the Pembroke Buoy current data. The data is augmented with some additional derived data, the wind strength on the Beaufort scale.

Of course it would be better to use existing XML schemas or RDF vocabularies than invent my own. However there doesn't seem to be anything in use which fits the bill There is some work on XML schemas for research data interchange but nothing for simple observations and forecasts that I could find. Perhaps the most comprehensive set of element names on which to build is to be found in the NOAA observation XML feeds such as this for Central Park, New York. This is a prime example of how data could be provided and its delightfully simple use makes it a good candidate for student exercises. In this format, both formatted strings and atomic values are provided. In contrast to the use of an attribute for unit, the element name is a concatenation of measurement name and unit, which seems somewhat problematic to me. The data has an attached XML schema but curiously the data is not valid according to this schema. Instead of omitting missing or undefined values, as the schema requires, the text NA is used instead. I emailed the office responsible for this data and was informed that they decided to do this because they got too many enquiries about missing data so they added the NA to make it clear it was really missing! There certainly seems to be a genuine problem there for users who don't read the schema, but my follow-up question as to why, in that case, they didn't change the schema went unanswered.

Weather data represents a case where the domain is generally understood and of interest, large quantities of data are being generated and the data is of critical importance to many users, making it an ideal case study in the web of data for my students. Despite widespread discussion of XML and RDF standards, practical data mashups must rely on hand-coded scrapping, home-build vocabularies and data extracted on dodgy legal grounds. Surely we can do better.

Collation is a core algorithm in processing sequences. In XQuery, the straight-forward expression of the algorithm is as a recursive function:



declare function local:merge($a, $b  as item()*) 
        as item()* {
    if (empty($a) and empty($b))
    then ()
    else if (empty ($b) or $a[1] lt $b[1])
    then  ($a[1], local:merge(subsequence($a, 2), $b))
    else  if (empty($a) or $a[1] gt $b[1])
    then  ($b[1], local:merge($a, subsequence($b,2)))           
    else (: matched :)
             ($a[1], $b[1],  
              local:merge(subsequence($a,2),
              subsequence($b,2)))
   };

Coincidently, Dan McCreary was writing an article in the XQuery wikibook on matching sequences using iteration over one sequence and indexing into the second. The task is to locate missing items. Collation is one approach to this task, albeit requiring that the sequences are in order.

Here is a test suite comparing three methods of sequence comparison.

I also did some volume tests with two sequences differing by a single, central value. Here are the tests on a sequence of 500 items. In summary, the timings are :

* Iteration with lookup: ~~6984 ms~~ - not repeatable - average is 2600
* Iteration with qualified expression: 1399 ms
* Recursive collate: 166 ms

The collate result is surprising and rather impressive. Well done eXist!

The Wallace Line

Sunday, May 24, 2009

Weather Data on the Web

Wednesday, May 13, 2009

Twitter Radio

Wednesday, May 06, 2009

Matching sequences in XQuery

Friday, May 01, 2009

More XQuery performance tests

Blog Archive

Links

About Me