SAX: biggest regrets
January 31st, 2005It’s seven years ago this January that I put out the first prerelease of SAX for consideration by the xml-dev mailing list. The final SAX releases contain the wisdom of a lot of people, but in the end, I had to make the final decisions about how it would work, and my record was mixed. Now that SAX is a standard (if unremarkable) part of the XML infrastructure, I thought it would be worth making two or three posts about what went wrong and what went right. In this post, I’ll start with my three biggest regrets about SAX/Java:
- SAXException does not extend IOException
-
XML parsing is a kind of I/O, and the exception should have reflected that. If we had done things that way, any library that does XML parsing could simply have thrown IOException, without having to expose any XML stuff at all or to force tunnelling of exceptions inside other exceptions, etc. This one bugs me every time I code with SAX.
- SAX uses callbacks instead of a pull interface
-
In this case, though, I probably wouldn’t do things differently if I could go back in time. To get acceptance, SAX had to work with all existing Java/XML parsers. They used callbacks, and the only way to get a pull interface would have been to run the parser in a separate thread, an approach wasn’t all that stable back in early 1998 (especially not on Windows). Callbacks are not a serious problem for most applications, but they do make event dispatching much more difficult and sometimes they make for messy, hard-to-maintain code. Now that Java thread support is rock-solid on all platforms, it’s easy enough to write a good pull-parsing adapter for SAX (I have one that I can release, if anyone cares). I’ve played around with StAX a bit, but none of the StAX drivers seems as stable as the SAX ones.
- SAX2 isn’t really simple
-
The original vision for SAX was to keep it dead simple. The XML 1.0 REC required that we report certain information, like processing instructions, but otherwise, I wanted to keep it as close to elements-attributes-content as humanly possible. SAX1 didn’t do too bad a job of that. SAX2 had to add support for namespaces, which messed up all the interfaces; at that point, people were screaming for all kinds of esoteric stuff that about 12 people in the world care about (i.e. entity boundaries). Instead of making SAX even more complicated, I invented the property and extension interfaces so that people could invent new things without cluttering the core. Then SAX ended up with all kinds of new, optional interfaces in the distribution anyway, so it’s quite nightmarish for a new user trying to figure out what matters and what doesn’t. If I ever put out a SAX3, I’ll do most of the work using the delete key, but that’s probably not possible when things like JAXP depend so heavily on SAX.
February 2nd, 2005 at 11:07:17
[...] ginson who led that effort seemingly effortlessly although not without some noteworthy regrets which I agree with. Wh [...]
February 2nd, 2005 at 11:07:25
[...] ginson who led that effort seemingly effortlessly although not without some noteworthy regrets which I agree with. Wh [...]
February 3rd, 2005 at 03:05:23
[...] ions Filed under: programming — david @ 10:01 pm Recently, I mentioned my biggest regrets about SAX. When we were b [...]
February 3rd, 2005 at 02:59:34
I wonder if I count as one of the twelve?
SAX is a fabulous piece of work, but I so desperately wish that it passed the base URI and original, possibly relative, system identifier to entityResolver() that it almost makes me want to cry when I think about it.
February 4th, 2005 at 02:13:43
[...] t into SAX, and one that I sometimes wish we had taken up (though it’s not one of my biggest regrets): represent any XML name [...]
February 6th, 2005 at 06:06:40
XML parsing is not I/O, and a SAXException is not a kind of IOException. This one you got right. Consider the case of an XML document stored in a String literal. The XOM unit test suite is loaded with these things. No I/O needs to be performed to parse them. Everything already exists within memory. Parsing is a completely separate operation from input and output, and it should be logically separate. Parsing is defined to operate on a sequence of bytes or characters. We often choose to represent that sequence as a stream for the sake of convenience and efficiency. However, we don’t need to. Parsing would work equally well if the data were represented as a byte array, char array, or something else.
There’s also the issue that even parsing a string may cause I/O to be done if an external DTD subset needs to be loaded. However, again the parsing of the XML and the input and output of that XML are two different operations. Problems in one are not problems in the other. They are conceptually distinct.
February 7th, 2005 at 09:58:08
Elliotte Rusty Harold wrote:
February 8th, 2005 at 01:51:31
[...] ry format then decide to cruft it up by layering a crufty XML API on it is beyond me. [0] http://www.megginson.com/blogs/quoderat/archi [...]
February 8th, 2005 at 01:51:33
[...] ry format then decide to cruft it up by layering a crufty XML API on it is beyond me. [0] http://www.megginson.com/blogs/quoderat/archi [...]
February 9th, 2005 at 01:26:55
[...] ng for you and give you a predigested view of the information. Nearby, Megginson’s biggest regrets and biggest satisfactions [...]