OpenDocument started out as the file format for OpenOffice.org, an open source office suite descended from Sun’s StarOffice. This XML-based format was deliberately designed to not be an XML representation of the internal structure of an OpenOffice.org document. Rather, it is a representation of an idealized document’s structure.
Bad News: OpenOffice.org (and all applications using that XML format) have to do some extra work to convert this idealized format into their particular data structures.
Good News: Applications which work with the XML format don’t have to force their data structures into the mold of a particular application.
The formation of a technical committee for the OASIS Open Office XML Format was announced in November of 2002. With participation from industry experts, the format was approved as the Open Document Format for Office Applications (OpenDocument) v1.0 in May of 2005, and the standard submitted to ISO in September 2005.
XML is human-readable, but is meant to be machine-readable. As a result, XML can be fairly verbose. This can become an issue in terms of storage space and bandwidth. In order to save space, OpenDocument’s designers decided to save the information in a zip file format. The zip file format is well documented, programs to read and write zip files exist cross-platform, and the format provides an acceptable balance between amount of compression and algorithm speed. You can see the complete rationale for this decision here.
That having been said, let’s write a small word processing file in an application that supports OpenDocument and unzip it to see what awaits within. We won’t go into great detail here–we’ll just look deeply enough so that you can get the idea of how the document you see in the application is related to the XML.
Inside the META-INF directory is the manifest.xml file, which lists all the other files that are in the zip file. For those of you familiar with Java’s JAR file format, be warned–the format of this manifest.xml file is not the same as the manifest you would find in the JAR file for a Java program.
The settings file stores application-specific data, such as window size. Your mileage may vary, so let’s move on.
The meta.xml file stores information about your document, including such things as the author’s name, the time the document was created, the number of times it’s been edited, and any other user-defined information that an application might wish to store, as shown in Figure 2.
An OpenDocument file can have two types of styles: ones that are globally defined for use throughout the document, and “automatic styles” that are defined on the fly (for example, switching to italics for a word or phrase). Figure 3 shows a global style built into the application, and Figure 4 shows a user-defined global style. The styles.xml file contains the global styles. The corresponding code is shown in Listing 1 and Listing 2.
Finally, we come to the content.xml file, which contains the main content of the document. It also contains automatic styles, such as the italic shown in Figure 5. The relevant style is shown in Listing 3. The content that refers to the built-in style, user-defined style, and automatic style is excerpted into Listing 4.
In order to extract information from an OpenDocument file, you need the following tools:
In order to create an OpenDocument file, all you really need is the ability to create a ZIP or JAR format file. An XML library that lets you serialize a data structure to a text file is extremely helpful.
Does anyone actually use the ability to extract OpenDocument information in the real world? I can’t speak for the real world, but I can speak for myself and tell you that I have. The first two applications are ones that I’ve actually used.
When I’m not teaching, I am a volunteer for a local amateur wrestling association, and part of my web update duties is to keep track of tournament attendance. Up until a month or so ago, I made charts of the attendance from a spreadsheet, as shown in Figure 6.
Then I read Edward Tufte’s book, The Visual Display of Quantitative Information, and realized that the chart served only to obscure the relationships that are obvious from the table. The task, then, was to extract the information from the tables in the spreadsheet and put them into an HTML file.
Here is an XSLT transformation
that will extract the information. There’s only one really
interesting part of the transformation is the part that handles
repeated cells. In order
to save space, OpenDocument lets you specify a
table:number-columns-repeated
attribute on a cell for a series
of cells with identical content.
In any ordinary programming language, we’d just write a loop to
handle the repetition, but XSLT doesn’t have such an iterative
construct. Instead, the process-cell
template uses
recursion to get its results.
To do the transformation, we used a Java progam that accesses the content.xml file and hands it to the Xalan XSLT engine. As a double check, we also unzipped the content.xml file and piped it to the xsltproc program with this command:
Speaking of ordinary programming languges, we wrote
a Perl program
to extract the data. Here the challenge is slightly different.
In Java, when you open a member of a zip file, it’s
indistinguishable from any other input stream. On the other hand,
the Archive::Zip
module doesn’t treat member files
as streams. This means we need
another Perl program
which uses Archive::Zip
to open the file and
then print it to standard output.
The transformation program can then open a pipe to the reader program and
take its output as a true stream.
In Python, the zipfile
library doesn’t treat members
of a .zip archive as normal streams. Instead, you simply read the entire
file into a string. This Python
program that extracts the table doesn’t use XPath at all; it
just uses the plain Document Object Model calls.
Ruby has a zip file module, but its ZipInputStream class does not supply all the calls necessary to make it indistinguishable from a true input stram. Thus, our Ruby program for extracting the tables must also read the file into a string and parse that string.
Another one of my duties is to keep track of the “grand champion” award. Competitors who place first at a local tournament get three points towards the award, second place gets two points, and third place gets one point. For state and regional tournaments, first place gets five points, second place gets four points, and so on. The person with the most points at the end of the season in each age group wins the award.
I enter the data from the result sheets, and rather than do the subtraction in my head, I just enter the placing into the spreadsheet for local tournaments. For state tournaments, which have fewer results to enter, I figure out the points and enter it as a negative number. Thus, in this spreadsheet, you can see that Ernie Varela got first place in the San Benito tournament and two points in the state tournament at Lemoore.
I've written a Perl program to convert the spreadsheet for each age group into the standings that are posted on the web site. In this Perl program, acquiring the contents of a table row constitute only about 25 lines of code; the remainder is calculation and output.
Of course, you want to be able to take data in other forms and convert it to OpenDocument. Here is a gradebook implemented in XML (all the names and emails and student ID numbers have been changed). I’ve written Java code to let me edit the gradebook. At the end of the semester, I have to calculate everyone’s grade and turn in a spreadsheet. This XSLT stylesheet takes the XML file and creates an OpenDocument spreadsheet content file as its output, as shown in Figure 8.
The following example is just a proof of concept: a Perl program that converts a spreadsheet containing a survey to a word processing document that shows the results with a pie chart for each question. (Yes, I know that Edward Tufte says never to use pie charts. This is just an example.)
It’s important to know that OpenDocument was not invented in a vacuum. Wherever possible, OpenDocument re-uses existing standards. For example, take this formula for standard deviation. If you look at its content.xml file, you will see that it is standard MathML
OpenDocument also re-uses elements from the Dublin Core in describing a document’s metadata, XSL Formatting Objects in text documents, and Scalable Vector Graphics for drawings.
More importantly, OpenDocument re-uses the XForms standard. The XForms standard is built around four concepts:
Figure 14 shows part of a form created in OpenOffice.org and its instance data.
Binding a control to part of the instance data is done with an XPath expression. In Figure 15, we see the binding that associates the form for entering the seller’s name with the corresponding node in the instance data.
One of the advantages of an XForm over a normal
HTML form is that you can submit the data in two different ways. Figure
16 shows two submit buttons and the corresponding bindings. The
savefile
submission will put a file on disk, leaving the
instance data untouched. The getData
submission will call
a
program on a server to fill in the product description,
You’ve seen that a simple set of tools will let you extract information from and create OpenDocument files. You’ve seen a few simple applications, which, I hope, have sparked your imagination and interest. Now it’s your turn to create something great!