Wednesday, April 01, 2009

Parameterised MS Word Documents with XQuery

It's coming round to exam time again at UWE, Bristol and as usual I've been struggling to get mine written. The XQuery-based FOLD application (which supports staff and students in our School) generates exam front pages contain exam details such as module code and title, examination date, length and time as HTML which had to be copied (poorly) into MS Word. This wasn't very satisfactory and it would be better to generate a Word document with the completed front page and sample pages with headers and footers. I'd put this off as it seemed too complicated. The Word XML format wordml is one route but it looked daunting to generate for scratch.

However for this application I only need to make some small edits to a base document. The most obvious approach was to 'parameterise' the Word document with place-holders. Unique place-holders can be edited in with Word before the document is saved as XML. Fields which are not editable in MS Word, such as the author and timestamps can be parameterised by editing the wordml directly. To instantiate a new Word document, the place-holders in the wordml are replaced with their values.

Treating this as string replacement is easier than editing the XML directly, even if this was possible in XQuery. The XQuery script reads the wordml document, serializes the XML as a string, replaces the placeholders in the string with their values and then converts back to XML for output.

Although this is not a typical task for XQuery and would be written in a similar way in other scripting languages, it is possible in XQuery with the help of a pair of functions which should be part of a common XQuery function library. In eXist these are util:serialize() to convert from XML to a string and the inverse, util:parse().

The function needs to replace multiple strings so we use a an XML element to define the name/value pairs:

let $moduleCode := request:get-parameter("moduleCode",())
..
let $replacement :=
<replacement>
<replace string="F_ModuleCode" value="{$moduleCode}"/>
<replace string="F_Title" value="{$title}"/>
<replace string="F_LastAuthor" value="FOLD"/>
..
</replacement>

and a recursive function to do the replacements:

declare function local:replace($string,$replacements) {
if (empty($replacements))
then $string
else
let $replace := $replacements[1]
let $rstring := replace($string,string($replace/@string),string($replace/@value))
return
local:replace($rstring,subsequence($replacements,2))
};

After gathering the parameter values and formatting a replacement element, the new document is generated by:

let $template := doc("/db/FOLD/doc/examtemplate.xml")
let $stemplate := util:serialize($template,"method=xml")
let $mtemplate := local:replace($stemplate,$replaceStrings/*)
return
util:parse($mtemplate)

Here the generated wordml is displayed in the browser, from where it can be saved, then loaded into Word. I found out the directive at the front of the wordml:

<?mso-application progid="Word.Document"?>

is used by the Windows OS to associate the file with MS Word so the media type is just the standard text/xml. However it is helpful to define a suitable default file name using a function in eXist's HTTP response module, the pair to the request module used to access URL parameters:

let $dummy := response:set-header('Content-Disposition', concat('attachment;filename=',concat("Exam_",$moduleCode,".xml") ))
let $dummy := response:set-header('Content-Type','application/msword')

The document could also be saved directly to the database, or all documents generated eagerly ready for use.

This approach feels like a bit of a hack, but it took only an hour to develop and is a major improvement on the previous approach. Changes to the base document will need re-parameterisation, but that seems a small overhead for slowly changing standard documents. XQuery forces a recursive approach to the string replacements where an iterative updating approach would avoid copying the (rather large) string, but performance is fast enough for this task, indeed in eXist string handling is very fast. My MS Office users will be happier but I still need to think about the Unix and Mac OS users.

No comments: