The Wallace Line: 2007

Monday, December 31, 2007

Servers down

Wouldn't you just know it? I no sooner blog about XQuery for Semantic web mashups than the servers in my department at the university go off-line and I guess they might not be back up now till the 2nd. About time we had eXist-db server space in the Cloud I say. I wonder if Amazon would be interested?

Sunday, December 30, 2007

DBpedia and Simile Timeline

Simile Timeline provides a neat way to display events and we have been using it on the FOLD project and on the DSA module to represent the events in the life of music artists and groups, with data extracted by hand from a copy of the Rolling Stones Review. Having discovered DBpedia, it seems obvious to progress to using this data source instead. The XQuery code is presented in an article in the Wikibook

The endpoint is

http://www.cems.uwe.ac.uk/xmlwiki/RDF/groupTimeline.xq?group=

and the parameter is the wikpedia page name (with underscores)

Some examples:

Only the album cover is displayed in the pop-up, with links to Wikipedia and DBpedia. Coverage of the minimal data required is quite good, but there are gaps, and the format of the release date varies. This is partly due to the need to encode not only the data but also the accuracy with which the date is known. Some wikipedians have used xs:gYear and xs:gYearMonth. The xs:date format, being bigendian, seems to naturally support progressive accuracy, but of course partial values like 2007-12 are not valid. In this example, I 've merely hacked a year out but this is not satisfactory.

There is also an HTML page view of the same data,with added comment text: e.g.

The Allman Brothers Band

A separate query creates an index, with links to both views, derived from a Category:

The next step is to derive a set of life events for the group and group members - births, marriages and deaths - to place on a parallel timeline.

Saturday, December 29, 2007

Football Teams, DBPedia and SPARQL

Having written a number of applications which scrape information directly from Wikipedia pages, I was delighted to discover the DBpedia project. Chris Bizer and his collegues have brought the rather nebulous ideal of the Semantic web to life with this RDF database built on extracted data from Wikipedia, browsable as linked data or queryable with SPARQL. With this resource, I see how to contrast triples and SPARQL with SQL and XML in the DSA course.

My first experiment has been to try to answer a question that has been recently prompted, in part, by the appointment of Fabio Capella as the manager of the England Football team. My godson Oliver and I were wondering just how international our club sides are and what better way to find out than to use the DBpedia data to create a map of the birthplaces of the players in a team.

The result is described in some detail in an XQuery Wikibook article.
Here for example are the players in the Bolton Wanderers team shown via GoogleMap. (you may have to refresh - there's often an initial server error ??)

Being based on an extract from Wikipedia some weeks ago, this data is not quite up-to-date, there is missing data and inconsistancies in property tagging but I couldn't do this without DBpedia - thank you guys.

Wednesday, November 28, 2007

Pipelines

Looks like its going to be the year of 'Pipes' in my teaching. I came across Yahoo pipes earlier this year and decided to restructure my second-level module, Data Schemas and Applications, around the 'web as database', before diving into local database technologies like Relational and XML databases. Yahoo Pipes provides an approachable starting point and has the added benefit of being a technology which none of my students, from a diverse collections of programmes, have yet encountered. Also, those slinky pipes and typed ports are -so- seductive.

However, the visual editor soon becomes awkward to use and thus leads naturally into using a scripting language instead. I started with XSLT expecting to move quickly into XQuery, but I've been surprised to find how much can be done, especially with XSLT2.0. For a transformation engine, I've set up a service (using Saxon8 via XQuery on eXist-db). This has allowed us to implement most of the yahoo pipes we'd written and also searches over an XML file with a single script containing a form and the search results. More..

Although many of the steps in a Yahoo pipeline can be handled within a single XSLT script, some of the processing I want to demonstrate involves processing HTML pages which are not XHTML, so I needed a tidy service too, and to be able to pipeline them together.

So... I need a pipeline language, a way of visualizing the pipeline and an engine to execute the pipeline. Naturally I started to write my own, based mainly on XPL which Eric Bruchez introduced me to at XML Prague. A tentative first step using an XQuery script is described in an XQuery WikiBook article.

Of course this is fine as play but I need to join the real world of pipeline languages. I suppose the main contenders are :

XProc (W3C)
XPL (in Orbeon)
DPML (in NetKernel )

NetKernel looks theoretically and practically very interesting and what's more, 1060research are a locally-based spin-off from HP labs next door.

Saturday, November 24, 2007

Topological sorting in XQuery

In the course of my attempts to implement an XQuery engine for the XML pipeline language XPL, I realized that I would need to sort the processes so that each process has its inputs available, i.e. a topological sort.

Given a sequence of nodes and references with the Relax-NG schema:



    element node {
           attribute id { xs:string },
           element ref {
               attribute id { xs:string }
           }*
       }+

The post-condition after sorting can be defined as:


declare function local:topological-sorted($nodes) as xs:boolean {
   every $n in $nodes satisfies
         every $id in $n/ref/@id
                satisfies $id = $n/preceding::node/@id
};

and the recursive function to order the nodes is:



declare function local:topological-sort($unordered, $ordered )   {
   if (empty($unordered))
   then $ordered
   else
       let $nodes := $unordered [ every $id in ref/@id satisfies $id = $ordered/@id]
       return local:topological-sort( $unordered except $nodes, ($ordered, $nodes ))
};

Sweet, eh? Even if this implementation is not the most efficient, it has the advantage of being obviously correct.

See (and improve!) the XQuery WikiBook article.

Sunday, November 18, 2007

the Prime Sieve in XQuery

I guess I've been putting my spare effort over the past few months into the Wikibook on XQuery, so the blog has been very neglected.

I've learnt a lot in writing the example code in the Wikibook. The other day I bumped in the Euler Project and started on the first few the problems. Problem 3 is about primes so I had to write a prime number generator. The Sieve of Eratosthenes in XQuery is so simple and obvious:


declare function local:sieve($primes,$nums) {
  if (exists($nums))
  then 
       let $prime :=  $nums[1]
       return local:sieve(($primes,$prime), $nums[. mod $prime !=  0])
  else $primes
};

local:sieve((),2 to 1000)

The list of primes starts off empty, the list of numbers starts off with the integers. Each recursive call of local:sieve takes the first of the remaining integers as a new prime and reduces the list of integers to those not divisible by the prime. When the list of integers is exhausted, the list of primes is returned.

Lovely but sadly not very practical for the size of numbers in the problem.

Discussion and execution are in the Wikibook.

Thursday, June 14, 2007

Web 2.0 in teaching

This year, I've used a few of the Web 2.0 tools to enliven my courses. This year, these techniques were used on two modules, DSA and IAD. Since I also teach web technologies, it seems appropriate that students gain some experience of these tools as part of the courses.

This is a summary of the value my students and I have found in some of them.

Background

The standard VLE at UWE is BlackBoard (UWEOnline). There are a few difficulties with the way it is administered here, particularly around the restricted access to material. Another problem is the lack of ability to link to modules or documents to UWEOnline, limiting the possibilities for integration.. Nonetheless, this the vehicle students expect to be used. Of course in this subject area much of the material is based on server languages like PHP, MySQL, and XQuery and thus resides on our faculty server.

Blogs

Blogs were used in two ways - as a means of communicating from teachers to students, with student commenting (DSA and IAD), and as a component of each students work.

Tutor blogs

The blogs are all hosted on Google's Blogger. During the year this improved in ease of use an in the ability to tag any item. Thus all items on lectures, or coursework 1 or PHP can be tagged and then browsed as a group.

Over the year there were over 50 posts to each blog. The tutors on the modules all had write access. Activity on the blogs was monitored by Google Analytics and showed the expected increase in traffic around lecture time and towards coursework hand-ins. Since the blogs were public,there was a background of hits from all over the world, and some items, such as one on the Periodic Table of Visualisation methods were linked to and gained considerable traffic. Comments on both blogs were few, from students and the public, but there were some.

I also occasionally post to my personal blog (here)and it was sometimes unclear where an item should be posted. I also tried to run one for the new 1 -year programme in Internet Application Development but this did not take off .

Lessons:

Students prefer a consistent interface to learning materials, although they also valued the narrative structure of the blog. However there are difficulties in using it to organise information which must be easily findable, even with the tagging facility. Thus, with some reluctance, I have decided to revert to putting lecture notes on BlackBoard rather than my own web site. BlackBoard will also contain the workplan. I plan to use my own blog to add my own commentary and additional material, with items tagged for consumption by one or other module even though the blog will only be able to refer to items in BlackBoard by name.

Student blogs

Students on IAD were required to keep individual blogs as a record of their reading over they year. This was not successful, partly as a result of concern for the damage to reputation that poor quality web appearances could have on future prospects. In this regard, it may be preferable to use the private blogs in BlackBoard for this purpose.

Wikis

I have used wiki's for some time, and one was used for web2.0 technologies. It is doubtful that this was beneficial, with so much excellent material on wikipedia. However for details of implementations and tools which are specific to CEMS, there is a need for a local wiki. This was established as the CEMS wiki and is being populated, though much more could be done before it becomes the first palce of enquiry about the use of a language, tool or technique in CEMS.

I also set up a student wiki in CEMS for use by students in organising and presenting their research into individual topics. This was quite successful, both as an opportunity to gain experience in wikis and as a means of organising their material. However, the student entries were more in the nature of online articles, with a low level of linkage. Students were required to hand-in a copy of their wiki material (since this is needed for external examiners) and this perhaps led to the lack of a linked structure. A more collaborative, group-based approach might be tried to get a larger network but there are obvious problems with attributing authorship.

Google docs

On one module,the tutors kept a register of attendance. We opted to use the recently available GoogleSpreadsheets for this, partly to explore the value of collaborative documents. Technically, this was very successful, and a big improvement on passing Excel spreadsheets around. However (there seem to be rather a lot of 'however's in this account :-( ) the spreadsheet feels rather cramped and lacks the ability to hold row and column headings in place when scrolling. (freeze panes)

RSS

An advantage of blogs is that they provide an RSS feed which can then be aggregated with other feeds. (unlike BlackBoard which does not) An attempt was made to use this feed to populate the announcements in Blackboard which was only partly successful, creating an additional point of failure. Take-up of RSS is still slow but the new myUWE portal supports RSS feeds so it will be suggested that the blog feed can be added by the individual student.

Podcasts

A couple of lectures were recorded on audio and posted to the blog. I also used Evoca to record short messages.
Whether these initiatives were valued is difficult to say. I think the audio alone is of limited value and integrated slideshow, video and audio is the ideal. Creating such resources for small to medium-sized classes, in such as rapidly changing field as the web, may not be economically viable.

Conclusion

These new technologies have great promise, but the fit with a corporate culture an institutional E-Learning policy creates tensions. This year's experiences are certain food for thought.

Saturday, June 09, 2007

XQuery functions

Over the last few weeks, I've been writing some general XQuery functions for our own system and in collaboration with Dan McCreary. There are XQuery collections to which these could be contributed, notably Priscilla Warmsley's //www.xqueryfunctions.com/xq/
but some of my functions depend on the eXist function libraries, and I would also like to prove the functions work and show examples of use in unit tests.

I'm currently working on some system tools to browse the code, develop full coverage tests and help refactor the code for the FOLD application. Part of this suite is an XQuery unit test tool. As an experiment, I've combined the two and put up a simple demo on our public eXist demo server.

The function is defined in the prolog which is prepended to each test before execution. The result can be either text or a node which is compared with the expected result as strings or with deep-equal respectively.

The test scripts themselves, viewed as XML look a bit of a mess but the source shows the formatting. CDATA sections are now preserved in stored files in eXist which is convenient when round-tripping XQuery code containing XML, but is controversial since CDATA is only intended to encode XML characters on input but is not part of the infomodel.

Links

CommonCode
The test script for a function Dan wrote
The test script for a BandMap structure

Next step is to document the test framework itself and provide the means for others to add test scripts.

Sunday, March 25, 2007

SPA2007 - day 0

I tinker endlessly with the eXist workshop material, still some solutions to do and the new server isn't having anything to do with the Java Webstart client. Hopefully it will get fixed before Wednesday afternoon. I'll also have to figure how to create multiple folders with the same initial contents - should be able to use a backup but not sure how to install in a different place.

Monday, March 19, 2007

FizzBuzz

Here's a quick XQuery solution to the FizzBuzz problem posed in David Patterson's blog.

I've taken the liberty of splitting the hyphenated string into two attributes.


let $config :=
<fizzbuzz>
<range min="1" max="100"/>
<test>
   <mod value="3" test="0">Fizz</mod>
   <mod value="5" test="0">Buzz</mod>
</test>
</fizzbuzz>

for $i in ($config/range/@min to $config/range/@max)
let $s :=  for $mod in $config/test/mod
      return
         if ($i mod $mod/@value = $mod/@test)
         then string($mod)
         else ()
return
if (exists($s))
then string-join($s,' ')
else $i

Thursday, March 15, 2007

Timelines

Here are some ways in which timelines are being used in the web:

TimeSearch
GoogleEarth timeline
HyperHistory
SIMILE timeline from MIT

Dinosaur timeline

Provides a Javascript API to create a complex panning timeline.
Events are uploaded in XML format which defines for each event

start title end? isDuration? latestStart? earliestEnd? image? link? and body

Cant locate schema for the event stream
Event date format is not xs:date
body must be a string i.e. with < not <
Set the mime type to application/xml
example XML timeline

I've implemented the SIMILE timeline in my Family History. Here is an example
Links to other people load their timelines (although some are incomplete and not working).

Multiple event streams can be displayed in multiple bands - e.g. here are the lives of two family members

It would be great to get a subset of world history events from TimeSearch to include as a separate band.

To do:

Convert xs:date to Timeline format
Handle photos with missing dates
Focus timeline when no date of birth

Wednesday, March 14, 2007

My Family History

Bamber Gascoigne's TimeSearch got me thinking about my family history project again.

The Family History project started with the idea of putting some family photos on the web, together with some meta data about the photo. Most had a list of subjects but only very occassionally a date. I had the idea that if the birth dates of the subjects were added, and the age of just one subject could be guessed, the system could infer the date of the photograph, and hence the age of all subjects. With birth and death data included, a timeline could be extracted for a person, including birth, deaths and marriages, photographs and even world events.

The resultant eXist/XQuery/XSLT prototype is here.

What would be nice would be be able to combine such personal histories with events in world history, or to compare the timelines for famous people with that of a family member. A simplistic approach would be to deep link into history sites and this is what the prototype has done, creating links like

These generated links are not as clever as those in TimeSearch, which I would guess are hand edited.

This is simple to accomplish, but its rather one-side - my site can link to another, but it can't mashup the data with my own. A mashup requires an API and a published format for exported content. I suppose we might look at hcalendar as one possible format but it would need extending it to support tags. RSS is another possibility.

TimeSearch by Bamber Gascoigne

The item on the BBC's Start the Week' by Bamber Gascoigne on the history site he has created called TimeSearch sparked my interest TimeSearch is a history search portal, using event stubs to provide links onwards to a wide range of sites - from general news sites, Wikipedia and Google images to specialist on-line resources. Events can be selected using two hierarchical category systems, location and them as well as year. I confess I had some trouble deselecting categories once selected.

This is the latest form of delivery of the extensive material Bamber has created and put on line in:

History World

An online encyclopedia of history with great little quizes.

OCEAN index

One wonders how this authored material compares with the collaborative Wikipedia. It is certainly an impressive and rich resource.

Tuesday, February 13, 2007

SPA 2007

I have been working on my session for this year's SPA for a while now. Seems a timely session to run according to Elliotte Rusty Harold's predictions for XML in 2007 (which mentions eXist).

Mostly I have been vacillating about which case study to use - I have so many part-finished projects. At first I revived the Whisky project, but after a lot of work, this felt like a bit of a slog - it was good as a modeling exercise but didn't reveal the power of XML and XQuery to my satisfaction. I then switched to the Bus Timetable project, which is much more realistic and one which I want to solve to replace the existing PHP/MySQL application with its hand-crafted timetables. However the data here is very complex, and the data files I have so far are incomplete. Nevertheless, this is the case study I settled on, although I did a short detour into the DVD hire example and of course there is always StudentsOnline, the main site we are developing with this technology.

Timetables come from TravelLine South-West in the form of TransXChange files. These first need processing to a more amenable structure. The actual data model is highly interlinked which makes for some XPath fun. Conversion is required to calculate absolute times (TXC has time differences between stops) and to convert from OS Easting and Northing to Lat/Long . With the simplified structures, I can present the data in various ways. HTML pages show the stops and the departures times but the neat idea is to generate kml files to overlay on Google Earth. Using a NetworkLink, I can display changing times on the map. The XPath date and time functions are a great help.

Other bits of functionality include an Ajaxian incremental Bus stop locator and membership registration rretainig favourite stops and services.

Here are some useful links for this project:

eXist database home
an XQuery tutorial from Howard Katz
A basic Timetable implementation (pending)
TransXChange documentation
TravelLine South West
Google Earth
kml description and reference

Monday, January 29, 2007

Computed synethesia

The Cloud appreciation society lead me to the Cloud Harp project. I'm listening to a recording made in 2004 - very ethereal.

It re-generates my interest in artificial synaesthesia generally and perhaps a return to my Bristol Harbour project, using image stream and image processing to pick out the frequency and direction of wave patterns to modulate music, with self-contained stations set up around the harbour.

Information arts often has artificial synaesthesia at its heart, converting information in one medium to information in another, with the aim of thus enhancing our appreciation and understanding of the source domain and creating beauty and interest in the target domain.

Sunday, January 28, 2007

Cloud Appreciation Society

One of my Christmas presents was 'The Cloudspotter's Guide' by Gavin Pretor-Pinney and published in 2006 by Hodder & Stoughton on behalf of the cloud appreciation society.

This site includes an excellent example of a subject-specific photo site, with many excellent photographs of cloud formations, optical effects and associated phenomena. The site was interesting in the context of the DSA module because it combines the problem of creating a photo site with the problems of tagging and categorising information, so I thought I'd write a review of it here, as perhaps as an example of what might have been included in theFlickr review. Next year I think I will base the coursework on a selection of sites like this rather than a monster site like Flickr. I welcome suggestions.

Navigation

The home page is a page featuring the last photo to be loaded.

Search by category and by partial string match within the title string are provided on this page. These lead to two different album pages - a paged layout in the case of the category, a list with thumbnails and titles in front of the page from which the search was launched in the case of the title search. Its not clear why there should be two different approaches. Usefully, the category selector shows the number of photographs in that category.

Direct links to the categories to which the current image has been assigned are provided and a featured category (Cloud Lookalikes) is at the top.

Each page includes a selection of other photographs. The criteria appears to be just the adjacent photo accession numbers and is not based on any criteria of similarity, but nevertheless it does encourage serendipitous browsing. Curiously the aspect ratio of these photos is determined by that of the main photo, leading to mal-shaped images.

Clicking on the photo rather surprisingly links to the next photo in sequence. There is also a 'Previous Page' link which is functionally a back button - it is not quite clear why this is included since this is a standard browser button. Perhaps Next and Previous links (in the accession sequence) would provide a better navigational mechanism.

A search (or link) to a members photos would be useful.

Photo Data

Images are jpegs at medium and thumbnail resolution.

Meta data include the name of the copyright owner (or is it just the member?), a descriptive title, data and time (but of upload I would guess, not of the photograph itself). The title can contain links e.g. to related sites.

Each photograph is related to several categories.

It is a pity that the photographs do not appear to be geo-coded, because it would be nice to mash these up with Google Earth. There is often a description of the place which could be translated through a geo-coding service. Since some photos are taken from planes, altitude data is needed too.

Tagging and classification
Each photograph is classified into one or several of around 40 categories and hence linked to other photos in the same category through that category. The category system itself is interesting as a information construct. On the surface, it would seem that it would benefit from some hierarchy. e.g.

* All photos
** Cloud Type
*** Cumulus
**** Kelvin-Helmholtz wave cloud
***
** Cloud features
*** Contrails
***
** Time-of-Day
*** Sunrise
*** Sunset
** Optical Effects
*** Rainbow
*** Halo
***
and probably others.

It also seems that some category meta-data is embedded in the name - the numbers on Cloud types is a reference to the Chapters in the book. Clearly a general description of the category itself could also be part of the overall data structure. The book provides a hierarchical classification of cloud types which would be useful to include. There are also occasional synonyms embedded in the name -eg Mamma (also know as mammatus) which could be treated uniformly as category meta-data.

The task of classifying each photo into these categories is not open to the public or members but is seen as an expert task. However there are places, such as in the category of 'Clouds that look like things' where folksomony would seem to be appropriate. Photographs also sometimes contain objects e.g. balloons, planes, boats for which arbitrary tags seem suitable.

Upload
Rather surprisingly, photo upload, even by members, is not supported, and photos have to be emailed to the webmaster with accompanying meta data for upload and categorisation.

Rating and comments
The public can easily rate photos and add comments. The comments are generally expressions of appreciation of no interest to anyone but the photographer. There is no information on the basis of the rating - indeed all photos I've seen are rated 4.It is not clear what value this adds or indeed what is being rated - the photograph or the subject.

Compatibility and Accessibility
Images have alt tags but these are the copyright names, not the description of the image which would better aid the reader.

There are keywords in the page meta data but these are not photo specific.

XHTML compliance - main photo page shows 1 error and 17 warnings. Typically these are problems with table and div tag nesting and problems with & . The error is a non-HTML tag.

The display of some characters in the title is broken (on Firefox and IE) e.g this

API and Feeds
The site provides RSS2.0 and Atom feeds of the latest photo additions and a Feedburner link.

Technology
The scripting language is PHP, images are held as files, not held in the database. Not sure what database is used (probably MySQL?). CSS is used for styling. Small amount of Javascript e.g. for GIF rollover on the rating buttons and call to update the rating. One function is defined (flip) which is not used.

The program architecture uses a single index.php script with a parameter to determine what page type to return, rather than multiple scripts. Interesting design issue here.

Data Model

Saturday, January 20, 2007

Where I've been

Just discovered this little application which generates a GIF map of the world showing the countries you have visited.

So many places yet to see, and so much of places visited unseen like most of the States:

From Douwe Osinga, a Google engineer

Thursday, January 18, 2007

Voice snippets

I've been experimenting today with http://www.evoca.com/

A free account allows you to record and store up to 60 minutes of sound. This interface is simple to use. Sound can be captured from a number of sources but the easiest is directly from a microphone on the computer. Everything is done through the web interface. When the recording has been made, an HTML snippet is provided to paste where you need it, and this invokes a flash player.

I'm trying these as an addition to my blog for the DSA module

I suppose it will just pollute the airwaves, but maybe it offers another mode for information dissemination which may just help some students. For two lectures last year I recorded the whole lecture and mounted it as an MP3 file, but they very long and un-synchronised with the Powerpoint slides. I tried calling out slide numbers but it was reminiscent of the early soccer commentaries which used numbered squares (the origin I learnt today of the phrase 'back to square 1') .

Monday, January 15, 2007

Whisky - Laphroaig

This entry is a sample of data sources on a specific whisky - assembled here as a notepad for the workshop at SPA2007 and for the Data, Schemas and Applications module.

Laphroaig
Laphroaig Distillery,
Port Ellen,
Isle of Islay,
Argyll,
PA42 7DU Tel. 01496 302418

Home site

General

Types of information

Chronology
Location

Gooogle Map http://maps.google.com/maps?q=PA42%207DU

Multimap http://www.multimap.com/map/browse.cgi?pc=PA427DU&scale=25000&cat=loc
Wikipedia Geohack

Pronunication - John Butler's site
Taste http://www.whiskyclassified.com/laphroaig.html
Tasting notes e.g. Mike Padlipsky
Ownership - changing ownership - see the yearbook
Bottlings http://www.laphroaig.com/whiskies/index.asp?expanded=our_whiskies
Glossary http://www.whiskymag.com/words/

Printed

Malt Whisky Yearbook 2006 p170
Collins Gem Whisky p 171

Beyond Belief

I spent part of this weekend listening to the presentations at BeyondBelief2006 for which all sessions over the two and a half days are available as Flash video. A great resource to have provided and very well done. Ongoing discussion on the Edge.

Amongst the speakers, I was particularly taken with

Carolyn Porco - the picture of Earth through the back-lit rings of Saturn is just wonderful.
Harold Kroto - vega science trust
Mahzarin Banaji - Implicit association tests
Ann Druyan - Carl Sagan

Amongst other thoughts provoked by these sessions, I was struck by the observation that it is scientists of all breeds who are most directly trying to understand and pay attention to 'God', whilst the religous and the theologians pay attention to the works of man - the books, the practices, the historical development of religous thought by man.