(Skip to main content.)

Blogs Quoderat Land and Hold Short

Quoderat

Archive for January, 2005

SAX: biggest regrets

Monday, January 31st, 2005

It’s seven years ago this January that I put out the first prerelease of SAX for consideration by the xml-dev mailing list. The final SAX releases contain the wisdom of a lot of people, but in the end, I had to make the final decisions about how it would work, and my record was mixed. Now that SAX is a standard (if unremarkable) part of the XML infrastructure, I thought it would be worth making two or three posts about what went wrong and what went right. In this post, I’ll start with my three biggest regrets about SAX/Java:

SAXException does not extend IOException

XML parsing is a kind of I/O, and the exception should have reflected that. If we had done things that way, any library that does XML parsing could simply have thrown IOException, without having to expose any XML stuff at all or to force tunnelling of exceptions inside other exceptions, etc. This one bugs me every time I code with SAX.

SAX uses callbacks instead of a pull interface

In this case, though, I probably wouldn’t do things differently if I could go back in time. To get acceptance, SAX had to work with all existing Java/XML parsers. They used callbacks, and the only way to get a pull interface would have been to run the parser in a separate thread, an approach wasn’t all that stable back in early 1998 (especially not on Windows). Callbacks are not a serious problem for most applications, but they do make event dispatching much more difficult and sometimes they make for messy, hard-to-maintain code. Now that Java thread support is rock-solid on all platforms, it’s easy enough to write a good pull-parsing adapter for SAX (I have one that I can release, if anyone cares). I’ve played around with StAX a bit, but none of the StAX drivers seems as stable as the SAX ones.

SAX2 isn’t really simple

The original vision for SAX was to keep it dead simple. The XML 1.0 REC required that we report certain information, like processing instructions, but otherwise, I wanted to keep it as close to elements-attributes-content as humanly possible. SAX1 didn’t do too bad a job of that. SAX2 had to add support for namespaces, which messed up all the interfaces; at that point, people were screaming for all kinds of esoteric stuff that about 12 people in the world care about (i.e. entity boundaries). Instead of making SAX even more complicated, I invented the property and extension interfaces so that people could invent new things without cluttering the core. Then SAX ended up with all kinds of new, optional interfaces in the distribution anyway, so it’s quite nightmarish for a new user trying to figure out what matters and what doesn’t. If I ever put out a SAX3, I’ll do most of the work using the delete key, but that’s probably not possible when things like JAXP depend so heavily on SAX.

The weblog stack

Monday, January 31st, 2005

Networking people love to talk about the network stack, like the 4-layer DoD model or the 7-layer OSI model, and web services boosters have picked up on that with their talk about the web services stack (an example from Judith Myerson at IBM , an example from David Orchard at BEA, and a bit of skepticism from Kendall Grant Clark) .

Should we be talking about a weblog stack? The web services stack almost always starts with HTTP rather than going all the way down into the lower-level networking protocols, so a similar weblog stack using RSS 2.0 would look something like this:

HTTP
XML
Namespaces
RSS 2.0
RSS 2.0 extensions (like the well-formed web extensions)

A diagram like this helps me to write an RSS library or aggregator, but does it leave me any more aware of how the blogsphere ticks? Not really, because not everything passes through this stack. For a non-full-text feed, for example, the headline and description show up this way, but then the main posting gets tome through a normal web HTTP+HTML route, totally independent of XML or RSS. Other kinds of communication bypass my proposed stack completely, like trackback and pingback, or even Technorati rankings for that matter.

Building a stack provides a cute technical model of one step in the weblog process, but it doesn’t explain how the whole thing works, much less why it works. In fact, human social products are almost always too messy to capture in simple trees or stacks. I faced exactly the same issue when I used to teach the history of the English language at university — technically, English is descended in a straight line from Old English, which is descended from proto-Germanic, which is descended from Indo-European. In reality, though, English borrowed an enormous amount of vocabulary and even syntax from languages like Latin, Greek, and French, which are not direct ancestors: imagine that you had your grandmother’s ears, but the nose of someone your mother happened to pass by on the sidewalk one day and a heart condition inherited from your father’s favourite 17th century Dutch painter, and you’ll see the problem.

Maybe the fact that weblog activities do not fit into a simple stack is not an unfortunate sign of a lack of intellectual rigour but the very reason for its success. Web services people, take note — you might want to try thinking less about new specifications and more about human behaviour.

Linking XML documents

Saturday, January 29th, 2005

[Update: help is on the way.] If you start with an XML document online (and granted, there are precious few of them), how do you use it to find other XML documents? If they’re XML+XHTML documents, you can follow the URLs in any xhtml:a/@href attributes you find in the document; if they’re XML+RDF documents, you can follow the @rdf:about and @rdf:resource attributes; if they’re XML+Docbook documents, you can follow the ulink/@url attributes; and so on.

But what about plain old XML? The best candidate seems to be XLink. While the specification is excessively complicated, it does offer the global xlink:href attribute as a simple linking attribute that any type of XML document can use: some document types, like XML Topic Maps, have taken full advantage of it.

Unfortunately, there is no conformant way to use just xlink:href in an XML document; every time it appears, you also need to have the xlink:type attribute set to the value “simple“. Oops! XTM gets around that by declaring the attribute with a #FIXED value in its DTD, so that it does not have to be repeated in the document itself, but we can hardly require every XML document online to use a DTD or schema, and if they don’t include xlink:type, they’re not conformant. So we cannot simply have

<musician xlink:href="http://www.example.org/bach/"/>
<musician xlink:href="http://www.example.org/beethoven/"/>
<musician xlink:href="http://www.example.org/vivaldi/"/>

but rather, we are forced to use

<musician xlink:type=”simple” xlink:href=”http://www.example.org/bach/”/>
<musician xlink:type=”simple” xlink:href=”http://www.example.org/beethoven/”/>
<musician xlink:type=”simple” xlink:href=”http://www.example.org/vivaldi/”/>

That gets extremely annoying after a few hundred times, probably enough to prevent it from getting universal acceptance. So what do we do? Is there any way to cheat and say something like all XML documents that do not have a DTD are assumed to have an implied DTD with a fixed declaration of xlink:type for every element? I don’t think so. The XLink recommendation was written by some of the brightest people in XML, and I’m sure that they didn’t intend for it to be so awkward for the simplest (and most common) case. It would be wonderful if the W3C could put out some kind of corrigendum stating that when xlink:type is missing, it defaults to “simple“. That’s all we need. Really.

[Update: I forgot to mention that the W3C's XML Linking working group no longer exists to make any changes to the spec.]

[Update: it turns out that the XML Core WG is working on this very issue: two days after my original posting, entirely by coincidence, Norm Walsh posted that xlink:type will likely become optional.]

Welcome to Quoderat

Saturday, January 29th, 2005

Welcome to Quoderat, David Megginson’s middle-aged grumblings about technology. My background is largely in XML and the web (I led the development of SAX and spent some time on W3C working groups), so markup, networking, and software development will be the primary topics, though the weblog will likely strike off in other directions. If you get bored of all this, feel free to wander off and read about airplanes in my other weblog, Land and Hold Short.