(Skip to main content.)

Blogs Quoderat Land and Hold Short

Quoderat

Archive for February, 2005

Business requirements: the weakest link?

Monday, February 28th, 2005

Nobody, or at least, almost nobody in the software engineering world believes in the waterfall design model any more. Like Santa Claus, waterfall sounded like a great idea (old guy comes down chimney and leaves free stuff in living room; planners work out all issues during design phase so that implementation is easy and predictable), but a little maturity and experience forces us to leave both fantasies behind.

So now we mostly agree that software and information system design has to be iterative and agile, and after years … well, decades … of heartbreaking failure, we’re actually starting to be able to deliver projects that work on a reasonable schedule and for reasonable cost.

But how do customers know what to order in the first place? Out of curiosity, I spent some time reading through the UN/CEFACT Modeling Methodology, or UMM, the business-side counterpart to the technological ebXML specifications from OASIS (the two can be used independently). If you don’t feel like reading hundreds of pages of documentation, here’s a quick summary: the UMM is a top-down business-modelling approach using forms and UML models to move from an abstract, executive-level view of a business to a more concrete, information-management view. Once the final, low-level UML model is ready, it can be handed off to technologists for implementation using ebXML, Web Services, or anything else convenient (it tries hard to be technology-agnostic). The decomposition of business modeling goes through four stages, called views:

  1. the business domain view separates the business into areas and processes, from the perspective of senior management
  2. the business requirements view deals with scenarios, inputs, outputs, and so on, from the perspective of an expert in the business domain
  3. the business transaction view is pretty-much the same thing, but more concrete and from a more technological perspective (i.e. less what? and more how?)
  4. the business service view deals with services, agents, and other stuff, and gets passed off to the software developer for implementation

In other words, it’s a waterfall. If that approach doesn’t make sense for technology, can it make sense for business management? As far as I can tell, business management and technology both deal with building robust, predictable systems against constantly-changing requirements and unpredictable inputs, so my own hunch is that a top-down approach like the UMM will not serve business itself well, and will make life hard for the technologists feeding out of the bottom of it.

In fact, I think that the problem may go much deeper, because no matter how business requirements themselves are developed, top-down or bottom-up, there is usually a waterfall-style leap between business requirements and technology requirements: businesses are supposed to figure out what they want, and technologists are supposed to figure out how to give it to them. In reality, however, business requirements are often driven by the technology available: few businesses were interested in setting up an online presence, for example, until the web made it cheap and easy to reach customers that way; outsourcing technical support to offshore call centres is economical only because of certain types of technology; opening up an organization’s systems makes sense only if enough potential partners are using something like ebXML or Web Services; and so on.

We’ve gotten a lot better at building systems to spec, so if we’re going to see similar improvements in the future, we’ll have to start looking at the specs themselves, learning to iterate all the way back up to the top, instead of just inside our little technology sandbox — in other words, it’s not just the technical requirements but the business requirements that have to be agile. If that means that the CEO occasionally has to be troubled with nuts-and-bolts details like web protocols or database scalability, so be it — it beats losing the whole company to a bad technology decision.

REST design question #5: the “C” word (content)

Wednesday, February 23rd, 2005

The other posts in this series of REST design questions has danced around the edge of the content problem dipping in its toes with issues like identification and linking, but now that the design questions are coming to a close, it’s time to dive right into REST’s biggest problem: content.

The principles of REST tell you how to manage resources in a CRUDdy way, but not what you can actually do with those resources. This is not a problem shared by other XML networking approaches: XML-RPC defines precisely what its XML content means, to the point that it can be serialized and deserialized invisibly and automatically; SOAP allows any kind of XML payload in principle (assuming it’s wrapped in a SOAP envelope), but most people use the default SOAP encoding which, again, can be serialized and deserialized somewhat automatically. REST, on the other hand, is pure architecture without any direct mention of content. RESTafarians boast that there are RESTful web applications already online for Amazon, , eBay, Flickr, and many others, but developers quickly figure out that they don’t get any benefit: each REST application requires its own separate stovepipe of code support right from the ground up, because they all use different content formats. If these all used XML-RPC or SOAP, there would be many standard libraries to simplify the developers’ work, and a lot of shared code that could work with all these sites.

Is REST, in practical terms, nothing more than a marketing word?

RESTafarians can argue that the lack of content standardization is a good thing, because it leaves the architectural flexible enough to deal with any kind of resource, from an XML file to an image to a video to an HTML page — moving the last two using XML-RPC or SOAP can be less than pleasant. On the other hand, the lack of any kind of standard content format makes it hard actually to do anything useful with RESTful resources once you’ve retrieved them. People have put forward candidates for standard XML-encoded REST content, including RDF and XTM, but it’s unlikely that either of these will take off, especially since RDF (the leader) does not even work nicely with most other XML-based specifications like XQuery or XSLT.

Standardizing XML REST content in bits and pieces

The alternative is to standardize content in bits and pieces — instead of trying to come up with a comprehensive data-encoding format, we can try to come up with a profile of standard markup bits that people can use in any kind of XML data document. Here are some of the possibilities:

xlink:href and xml:id for linking

I’ve already mentioned how the use of the xlink:href attribute will make it possible to design XML data crawlers similar to HTML crawlers, along with search engines and all the other good things that follow: no matter what the document type, the engine will be able to find the links.

Together with xlink:href, xml:id can allow links to point to fragments of XML documents easily, making it possible to refer to embedded resources.

<data>
  <person xml:id="dpm">
    <name>David Megginson</name>
  <person>
 
  <weblog>
    <title>Quoderat</title>
    <author xlink:href="#dpm"/>
  </weblog>
</data>

This stuff is critical — since REST is all about linking, lack of a standard linking mechanism in content will simply kill it before it can even start.

xml:base for document identification

Similarly, the xml:base attribute can provide an identifier and locator for an XML data document. An xml:base attribute attached to the root element can both give a base URL for resolving relative links in the document and a global identifier for the document.

<data xml:base="http://www.example.org/data/foo.xml">
  ...
</data>

xsi:type for data typing (?)

Do we need data typing at all in XML? The use of external schemas is generally a bad idea both for performance and security reasons, so if we want typing at all (at least for simple data types), we should do it in the document instance itself, using something similar to the xsi:type attribute. Norman Walsh doesn’t like this approach, but for reasons different from mine: I think that typing information is useful mainly for authoring, not publishing; Norman would prefer to see it offloaded into external schemas. If you want typing at all, I think that something like

<start-date xsi:type="xsd:date">2005-02-23</start-date>

is generally inoffensive, aside from the fact that it uses Namespace prefixes in attribute values (a bit of a nasty kludge). Compared with bolting a whole schema onto our poor little XML data document, however, it’s a lightweight solution, assuming that it actually adds useful information.

Dublin Core for simple, basic properties (??)

The Dublin Core failed completely in the HTML meta element, and many people don’t think it’s particularly well set up, but somehow those original 16 simple property names still have a lot of popular recognition in the tech community. By far the most useful of the property names is dc:title, which identifies the name of a resource (for display in a pick list, search engine results, and so on).

<city xmlns:dc="http://purl.org/dc/elements/1.1/">
  <dc:title>San Diego</dc:title>
  <region>California</region>
  <country>US</country>
  <population>1223400</population>
</city>

Will people go for this, though, or will the Dublin Core fizzle out here as well?

What else?

What other bits and pieces are out there that people would actually use in XML data files served out by RESTful web applications? I’m not convinced that the xml:space attribute is all that useful for generic XML data files, since it’s about formatting rather than meaning; the xml:lang is useful in XML documents intended for human readers, as I’ve mentioned, but for fielded data, I’d rather see language information in its own proper field, maybe using the Dublin Cores dc:language element (if the Dublin Core succeeds). Perhaps people will borrow rss:enclosure from RSS 2.0, for lack of any other standard way to indicate an external non-XML resource.

I’d love to hear other suggestions of what might appear in a simple profile for XML data REST content.

REST design question #4: how much normalization?

Tuesday, February 22nd, 2005

[Update: why this has to do with REST] Here is the fourth in a series of REST design questions: how much should the XML data files returned by a REST web application be normalized into separate XML files? For example, if an application is returning information about the film Sixteen Candles, should it try to put most of the relevant information into a single XML file, like this?

<film>
  <title>Sixteen Candles</title>
  <director>John Hughes</director>
  <year>1984</year>
  <production-companies>
    <company>Channel Pictures</company>
    <company>Universal Pictures</company>
  </production-companies>
</film>

Or should it link to separate XML documents containing information about people, companies, and so on, like this?

<film xml:base="http://www.example.org/objects/014002.xml" xmlns:xlink="http://www.w3.org/1999/xlink">
  <title>Sixteen Candles</title>
  <director xlink:href="487847.xml"/>
  <year>1984</year>
  <production-companies>
    <company xlink:href="559366.xml"/>
    <company xlink:href="039548.xml"/>
  </production-companies>
</film>

(Of course, you can take this a lot further, making the relationships themselves, like isDirectorOf, into separate XML files, but this is enough to give a good flavour.)

Presumably, the REST server is creating the XML information from a relational database that is normalized, so the regular arguments about maintainability, etc. are not an issue. Still, each example has its disadvantages:

  • In the first example, the client application cannot be certain that two separate records are referring to the same director or production company, or to a different one that happens to have the same name. It will also be hard for the server to handle a PUT request to update the (normalized) database.
  • In the second example, the client application will have to make a ridiculous number of GET requests to assemble enough information for even the most basic application, like a cast list: complete information for information like cast, crew, and locations even for a single movie will likely involve retrieving over hundreds or thousands of tiny XML files.

Would imitating HTML be the best compromise? HTML links (the a element) typically include both a reference to an external resource and a short, local description of the resource at the other end of the link (i.e. the blue, underlined text). There is no reason that XML data files in a REST application cannot do the same thing, combining the advantages of the normalized and unnormalized approaches, as in this example:

<film xml:base="http://www.example.org/objects/014002.xml" xmlns:xlink="http://www.w3.org/1999/xlink">
  <title>Sixteen Candles</title>
  <director xlink:href="487847.xml">John Hughes</director>
  <year>1984</year>
  <production-companies>
    <company xlink:href="559366.xml">Channel Pictures</company>
    <company xlink:href="039548.xml">Universal Pictures</company>
  </production-companies>
</film>

Now, a simple REST client application does not need to retrieve extra data files simply to find the name of the director or production company, but it still knows where to look for more complete information. It can also use the link URLs as identifiers for disambiguating people, companies, and so on. The approach will also be familiar to web developers, the ones who will eventually decide whether to use REST for data retrieval.

Now, what about a REST application that supports not only GET but PUT? What should it do when someone tries to check in this document? I’d suggest that any information under an element with an xlink:href attribute should be considered non-canonical and ignored during the checkin — you don’t want to rename John Hughes on the basis of the description of one of his films — and that the label information inside the link be autogenerated at the next GET (presumably from the resource at http://www.example.org/objects/487847.xml).

This particular design question comes from personal experience during the late 1990s — the project involved moving precisely this kind of information in very large quantities to eCommerce customers. In that case, PUT was not an issue, since the customers did not have write access to the provider’s database.

(Josh Sled quite reasonably asks what this question has to do specifically with REST. The main selling point of REST is linking resources together, so I believe that figuring out when to link and when to embed will be critical to making REST-based applications work. Josh also mentions RDF. The project I mentioned actually was trying to use RDF [first the 1.0 WD, then the REC]; unfortunately, RDF makes an example like my third one difficult, since in 1.0 at least, a property had to have either a link or content, but not both; you end up having to create a new, inline resource for every link, which is messy. I’m not too familiar with the newer RDF version, so I don’t know if they’ve fixed that by allowing labeled links.)

REST design question #3: meaning of a link

Friday, February 18th, 2005

This is the third in a series of REST design questions. The first design question asked about keeping track of location and identification information after you have downloaded an XML file; the second design question asked about discovering resources and dealing with long lists of data in a RESTful way.

The very heart of REST, both in its narrow original sense (everything must have a URL) and its broader popular sense (basic HTTP + XML as an alternative to Web Services), is linking. REST insists that any information you can retrieve must have a single, unique address that you can pass around, the same way that you can pass around a phone number or an e-mail address — those addresses make it possible to link resources (HTML pages or, in the future, XML data files) together into a web, so that either people or software agents can discover new pages by following links from existing ones.

Old-School Hypertext

But what does a link mean? That question matters a lot for anyone writing general-purpose REST software, such as search engines, data browsers, or database tools, that are not designed to work with only a single XML markup vocabulary. The pre-HTML Hypertext specialists believed that links could have many different meanings, and typically wanted to provide a way for the author to specify them; hiding in the shadows during the web revolution of the 1990s, the old-school managed to keep the fire alive long enough to add the universally-ignored xlink:type attribute to XLink. Do we need xlink:type for generic XML data processing in a REST environment?

I don’t think we do.

In fact, if you take a look closely, linking to an external resource from an HTML document always means the same thing:

Here is a more complete version of what I’m talking about.

It is very hard to think of any exceptions. For example, consider these three links from an HTML document:

<p>During the <a href="http://en.wikipedia.org/wiki/Renaissance">Renaissance</a> ...</p>
<img alt="Illustration of Galileo" src="galileo.jpg"/>
<script src="validate-form.js"/>

In every case, the element containing the link attribute is a placeholder for something somewhere else. Obviously, they cause different browser behaviour — the picture will be inserted into the displayed document automatically, while the Wikipedia Renaissance entry will not — but in all three cases, the thing linked represents something more complete: the Wikipedia Renaissance article is more complete than the phrase “Renaissance”, the image galileo.jpg is more complete than the alternative text “Illustration of Galileo”, and the Javascript code is more complete than the script placeholder.

New-School XML

Exactly the same principle will likely apply to links in XML data files, like this example:

<person xml:base="http://www.example.org/people/e40957.xml" xmlns:xlink="http://www.w3.org/1999/xlink">
  <name>Jane Smith</name>
  <date-of-birth>1970-10-11</date-of-birth>
  <employer xlink:href="http://www.example.org/companies/acme.xml">ACME Widgets, Inc.</employer>
  <country-of-birth xlink:href="http://www.example.org/countries/ca.xml">Canada</country-of-birth>
</person>

All of the information available for the person’s name is the string “Jane Smith”, and all of the information available for the date of birth is the string “1970-10-11″; however, there is more complete information about the employer at http://www.example.org/companies/acme.xml, and there is more complete information about the country of birth at http://www.example.org/countries/ca.xml.

It seems that unidirectional links like those used in the web always lead towards increasingly canonical information. If an XML element has a linking attribute, then, can we assume that the entire XML document subtree starting at that element represents a lesser version of the information available externally at the link target? If so, can we really gain much by adding xlink:role to the mix?

Snags

Is this a safe-enough assumption that we could use it with any RESTful XML data files, and perform some kinds of data processing without having to know about the specific XML vocabulary in use?

I can think of two counter-examples right away, and they both deserve some attention. First, there is one context where HTTP URLs frequently appear as attribute values in XML documents but do not refer to a more complete version of the information inside an element: XML Namespaces. Here’s an example:

<person xmlns="http://www.example.org/ns/">Jane Smith</person>

In this case, there may be no information available at all at the location http://www.example.org/ns/; if there is something there (like an RDDL file), it will most likely be information about the XML markup, not about the person Jane Smith. Of course, this URL does not appear as the value of an xlink:href attribute, so there is no ambiguity. More importantly, the use of URLs for Namespace identifiers (a choice which I supported) as caused an enormous amount of confusion among XML users, who expect the URL to point to something — something more complete or authoritative, that is. That very confusion is proof of how ingrained this use of linking is.

The second counter-example is the rise of the rel="nofollow" attribute in HTML links, partly as an attempt to counter spam in weblog comments and wiki sandboxes. If anything, this appears to vindicate the old-school hypertexters. They should be rushing into the street in disheveled clothing with a mad gleam in their eyes, shouting “Look, we were right! It took 15 years, but finally everyone sees that links do need semantic information attached!” and so on. But they’re not, probably because they’re smart enough to realize that this isn’t, quite, change the primary meaning of a link. The rel="nofollow" attribute says that the author does not endorse the link target, but it still provides a more complete version of the information. For example, someone who strongly dislikes the U.S. Libertarian Party might want to point to their web page without improving their Google search ranking, and thus include something like this:

<p>In contrast to the misinformation coming from the <a href="http://www.lp.org/" rel="nofollow">Libertarian Party</a> ...</p>

The resource at http://www.lp.org/ is still a more complete version of the information in the XML element, even if it is information that the author does not particularly like.

Implications

If linking really can be this simple, then we will be able to do a lot with XML data and REST even if we do not agree on a common content encoding. That could be enormously valuable: the document web was an enormous success precisely because content-encoding was standardized on HTML, so that people could build things like authoring tools and search engines. If we can do a lot of the same thing with an XML document web without forcing everyone to squeeze their data into something like RDF or XTM, we might just be able to get enough people to play along to make it work.

REST design question #2: listing and discovering resources

Wednesday, February 16th, 2005

The second in my series of REST design questions is how to handle listing and paging, or, in fancier jargon, resource discovery. I prefer concrete examples, so I’ll start with one that I know is flawed and then try to find ways to fix it.

Let’s say that I have a large collection of XML data records with URLs like http://www.example.org/airports/cyow.xml and http://www.example.org/airports/cyyz.xml, and so on. Since they all share the same prefix, it would be reasonable to assume that performing an HTTP GET operation on that prefix (http://www.example.org/airports/) would return a list of links to all of the data records (though I acknowledge that URLs are opaque and no one should rely on that, etc. etc.):

<airport-listing xmlns:xlink="http://www.w3.org/1999/xlink" xml:base="http://www.example.org/airports/">
  <airport-ref xlink:href="cyow.xml"/>
  <airport-ref xlink:href="cyyz.xml"/>
  <airport-ref xlink:href="cyxu.xml"/>
  ...
</airport-listing>

This is a wonderfully RESTful example, since it shows how (say) a search-engine spider could eventually find and index every XML resource. However, anyone who’s ever worked on a large, production-grade system can see that there’s a huge scalability problem here (I’m leaving out other possible issues like privacy and security). For a listing of a few dozen resources, this is a great approach. For a listing of a few hundred, it’s manageable. A listing of a few thousand resources will start to consume serious bandwidth every time someone GETs it, and a listing of a few million resources is simply ridiculous.

HTML-based web applications designed for humans typically employ a combination of querying and paging to deal with discovering resources from a large collection. For example, I might start by specifying that I’m interested only in airports with instrument approaches within 500 nautical miles of Toronto; then the application will return a single page of results (say, the first 20 matches), with a link to let me see the next page if I’m interested.

How would this work for a REST-based data application? Clearly, we want to use GET rather than POST requests, since pure queries are side-effect free, so presumably, I’d end up adding some request parameters to limit the results:

http://www.example.org/airports/?ref-point=cyyz&radius=500nm&has-iap=yes

That’s certainly not the kind of pretty REST URL that we see in the examples, but it does look a lot like the ones used in Amazon’s REST web services, so perhaps I’m on the right track. Of course, there will have to be some way for systems to know what the available request parameters are. Now, perhaps, the result will look something like this (assuming 20 results to the page):

<airport-listing xmlns:xlink="http://www.w3.org/1999/xlink"
    xml:base="http://www.example.org/airports/?ref-point=cyyz&radius=500nm&has-iap=yes">
  <airport-ref xlink:href="cyow.xml"/>
  <airport-ref xlink:href="cyyz.xml"/>
  <airport-ref xlink:href="cyxu.xml"/>
  ...
  <next-page-link xlink:href="http://www.example.org/airports/?ref-point=cyyz&radius=500nm&has-iap=yes&start=21"/>
</airport-listing>

As far as I understand, this is good REST, because the XML resource contains its own transition information (i.e. a link to the next page). However, this is pretty unbelievably ugly. Presumably, the same kind of paging could work on the entire collection when there are no query parameters, so that

http://www.example.org/airports/

or

http://www.example.org/airports/?start=1

would return the first 20 airport references, followed by a link to http://www.example.org/airports/?start=21, which will return the next 20 entries, and so on. The potential power of REST and XLink together is clear: it is still possible to start at a single URL with a simple crawler and discover all of the available resources automatically, and unlike WS-*, I did it without having to deal with extra, cumbersome specs like UDDI and WSDL. Still, this looks a bit like an ugly solution to me. I’ll look forward to hearing if anyone can come up with something more elegant.

The best Firefox extension

Tuesday, February 15th, 2005

Anchor in Firefox.

The Firefox browser has a lot of well-loved extensions like AdBlock and ImageZoom (especially useful for looking at weather maps online), but my personal favourite is a little-known one called Show Anchors

Anyone writing for the web — and especially a blogger — needs to link to web pages a lot. Often, the web pages contain anchors that would let us link to the exact spot we need rather than to the top of a long document, but unless you can grab them from a table of contents or you are willing to spend a while reading through View Source, those anchors are pretty hard to find. For example, here is see a screenshot of Firefox viewing the W3C’s XML Recommendation (click on the thumbnail for full size):



.

The page is full of anchors, but you cannot find them. With the Show Anchors extension in Firefox, I simply right click on the browser window, select Show Anchors from the pop-up menu, and the display in Firefox changes (again, click on the thumbnail for full size):



Inside the Firefox window, clicking on one of the anchor icons copies a full URL, with fragment identifier, to the clipboard. It’s a real timesaver for writing weblog entries.

xml:lang is an accessibility issue

Tuesday, February 15th, 2005

Charl van Niekerk has an interesting posting on a topic that should have been be more obvious to me: that the xml:lang attribute (and HTML lang) are critical for making online information accessible to the visually-impaired. Voice synthesizers that read documents aloud need to know what language they’re reading, and it wouldn’t take much effort for us to tell them.

Obviously, this is a less critical issue for data-oriented XML, but even then, XML data often contains large chunks of prose (like product descriptions) that are, eventually, intended for human consumption. I won’t promise to rush and fix all of this today in my existing XML and HTML, but I’m certainly going to try harder in the future.

REST design question #1: identification

Monday, February 14th, 2005

My first REST design question is about the fact that RESTafarians seem to consider identification and location to be the same thing, and following from that, the question of how to make identification persistent in XML resources. For example, assume that http://www.example.org/airports/ca/cyow.xml is both the unique identifier of an XML data object and the location of that object on the web. That’s the whole point of REST, really. RESTafarians don’t like interfaces where identifiers are hidden inside XML objects returned from POST requests to unrelated URLs, for example (in fact, they get angry in quite an amusing way).

GET and PUT

So, here’s a simple use case. Let’s say that I download the XML data file at http://www.example.org/airports/ca/cyow.xml and it looks like this simple example:

<airport>
 <icao>CYOW</icao>
 <name>Macdonald-Cartier International Airport</name>
 <political>
  <municipality>Ottawa</municipality>
  <region>ON</region>
  <country>CA</country>
 </political>
 <geodetic>
  <latitude-deg>45.322</latitude-deg>
  <longitude-deg>-75.669167</longitude-deg>
  <elevation-msl-m>114</elevation-msl-m>
 </geodetic>
</airport>

I then copy it onto a USB memory stick, bring it home from work, copy it onto my notebook computer, and work on it while offline during a business flight. The file no longer has any direct connection with its URL: it has gone through other transfers since the HTTP GET request I used to download it. How do I know what I’m working on or where I should PUT it when I’m done?

If this information has to be kept out of line, then some of REST’s advantages are evaporating, because now I have to start using custom-designed clients again instead of simply piggybacking on existing web technologies. As an identifier, the URL is clearly part of the resource’s state, and belongs in the XML data file; as a location, however, it is superfluous information and belongs only in the protocol (HTTP) level.

Where does the document identifier go?

Let’s assume that I get over my squeamishness and decide that the URL is a proper identifier and belongs in the XML representation. Now, how do I do that in a fairly generic way? xml:id is out of the question, since it’s designed only to hold an XML name for identifying part of a document, not a URL to identify an entire document. I could use (or abuse) xml:base, like this:

<airport xml:base="http://www.example.org/airports/ca/cyow.xml">
 ...
</airport>

I’m not certain, though, how XLink processors would deal with that. Would the relative URL “cyyz.xml” end up being resolved to http://www.example.org/airports/ca/cyyz.xml or http://www.example.org/airports/ca/cyow.xmlcyyz.xml? There’s also the possibility that some highly-cooked APIs might predigest the xml:base attribute so that application code never sees it. Do the XML standards people believe this kind of an xml:base usage is legit?

If xml:id is unusable, and xml:base is problematic, it looks like there might be no standard way to identify RESTful XML documents, and each XML document type will need its own ad-hoc solution. Any suggestions? Does the world need one more xml:* attribute (I hope not)?

I’d be interested in hearing how REST developers have dealt with identifier persistence and round-tripping when the identifier is the URL.

REST design questions

Monday, February 14th, 2005

[Update: fifth and final question added] I’ve been thinking a bit about REST recently while working on a new data-oriented application. REST in its now-broadened meaning is easy to explain: pieces of data (likely XML-encoded) sit out there on the web, and you manipulate them using HTTP’s GET, PUT, and DELETE methods (practically CRUD, except that the Create and Update parts are combined into PUT). Try explaining SOAP, much less the essence of the whole WS-* family in one easy sentence like that, and you’ll see the difference.

This very simplicity should raise some alarm bells, though. RDF also has an apparently simple data model, but for RDF 1.0, at least, the model turned out to be painfully incomplete, as I found out when I implemented my RDF parsing library. Is REST hiding any of the same traps? RESTafarians point out that REST is the basis of the Web’s success, but that’s really only the GET part (and its cousin, POST). Despite WebDAV, we have very little experience using PUT and DELETE even for regular web pages, much less to maintain a data repository. Even the much-touted RESTful web services from Amazon and eBay are GET-only (and POST, in eBay’s case); in fact, many, if not most firewalls come preconfigured to block PUT and DELETE, since web admins see them mainly as security holes.

My gut feeling is that REST is, in fact, more manageable than XML-RPC or WS-* for XML on the Web, but that we have a lot of issues we’ll need to work out first. Data management is never really simple, and while WS-* makes it harder than it has to be, even the simplest REST model cannot make it trivial. I’m going to post some of my own questions about REST design from time to time in this weblog, as I think of them, and I’ll look forward to hearing from people who have already dealt with or at least thought about these problems on their own.

Here are my questions so far:

Open Web, Closed Databases?

Monday, February 14th, 2005

Web site developers seem to be getting open specifications: more and more, I’m seeing sites developed for specifications like (X)HTML, CSS2, DOM, etc., not sites developed for applications like MSIE or Firefox or Opera; I’m seeing Java-based web apps that work with any J2EE-enabled web server, instead of apps that work only with Tomcat or WebSphere or WebLogic; and so on.

After all this, then, I’m surprised to see how many open source web apps specifically require MySQL rather than just “a SQL database. ” MySQL is a fine database, of course, but here we have an open specification, SQL, that’s been around far longer than most of the web specs, and many open source developers are choosing to lock themselves into a single database anyway.

I wonder what gives. I don’t have a lot of experience with PHP, which is the platform for many simpler web apps (including WordPress, which drives this weblog, though it offers an alternative) — is there no generic SQL database interface for PHP, or do the developers just not care? Are there serious performance issues using generic database interfaces? Or are my observations not representative, and in fact most open source web app developers do avoid locking themselves in to MySQL?

Rumours of xml:id trouble in the W3C

Friday, February 11th, 2005

W3C logo

[Updated: see below] Norman Walsh has just posted an unusual essay. The gist of it seems to be that the W3C (at some level) has decided to modify the xml:id specification (released only days ago as a Candidate Recommendation, as I mentioned here) — there is some other specification (not named) that has a bug, most likely an incorrect closed enumeration of all the possible attributes in the XML namespace. At some level, the W3C has decided that the attribute will be renamed to the unqualified xmlid to avoid upsetting the people who messed up the other spec.

Norm sounds mad, and I don’t blame him. I remember when I was on the original XML working group and we were ordered from above to rewrite the XML Namespaces spec substantially for extremely questionable reasons (mainly the ability to embed XML inside non-XML HTML documents for v3 browsers — seriously).

[Update: Norm has revised the essay, adding enough extra information to the essay to let us figure out the problem -- it has to do with the interaction between the XML Canonicalization (C14N) and xml: attributes, where C14N mistakenly assumes that all xml: attributes should be automatically inherited. Here's the official request to deal with the issue.

I had already mentioned the incompatibility with C14N in my first posting on xml:id, then forgot it completely when reading Norm's essay. So far, this is just a dispute, not a done decision.]

Hub URLs and feudalism in the blogsphere

Friday, February 11th, 2005

Web pages, and especially weblogs, include apparently unnecessary links all the time. For example, is there really any need to link to Microsoft every time I mention the company’s name? Is anyone reading this posting going to follow the link (and if so, would that person have had trouble finding the site otherwise)?

Hub and Spoke

The best term I can think of to describe these links is hub URLs. They’re very much like airport hubs — connections from many smaller places feed into them, and often the only way to get from one small place to another is by passing through the hub as an intermediate point: for example, if I link to Microsoft and you link to Microsoft, someone can trace a route from my web page to your web page by changing planes, so to speak, at the Microsoft hub. One way to make the trip is to put http://www.microsoft.com/ into Technorati or a similar search engine that can supply ongoing results in an RSS or Atom feed, then read the postings that congregate around this hub URL in the blogsphere. The weblog postings are not linking to Microsoft so that you can find Microsoft; they’re linking to Microsoft so that you can find them. The nature of a hub URL is that the spoke web sites need it more than it needs any one of the spokes.

To take a less hackneyed example, here is a Technorati RSS feed of all weblog postings that link to Roy Fielding’s famous dissertation on web architecture. Granted, that’s not a very active hub URL, but still, all of the postings that link there form a community of interest, and a RESTafarian will almost certainly want to subscribe to such a feed. I expect that, more and more, the blogsphere will start grouping itself around hub URLs at least as much as it groups itself around individual personalities today.

Travel agencies

So far, so good. Search engines, the travel agencies of the web and blogsphere, already know how to take advantage of these hub URLs, as in the Technorati example I just cited above. Unofficial rumour has it that Google, for example (there I go again with a hub URL), makes great use of hub URLs for determining the relevance of search results. In fact, the whole push towards tags and folksonomies by sites like Technorati, Flickr, and del.icio.us is really an attempt to set up their own hub URLs.

In Technorati’s case, the travel agent wants not only to plan trips but to own the airport hub itself: that’s why they’re encouraging bloggers to link to the tags section of their site, making URLs like http://www.technorati.com/tags/web into hub URLs that are entirely under their control; it does not seem likely that their competitors will go along with that idea, though.

Castles and Boroughs

On problem is that the most popular URLs might end up becoming not only hubs but castles. Castles are cute tourist attractions today, associated mainly with pseudo-medieval romantic kitsch like knights and tournaments, but in the Middle Ages they were often instruments of oppression. While free landowners may originally have congregated around them for protection, they often lost their freedom (either by choice or coersion) and became feudal serfs, little more than the property of the powerful thugs who controlled the castles. If we start building our weblogs and sites in clusters around powerful hub URLs the way that free peasants built their huts around castles, are we risking the same fate?

Castles don’t show up automatically whenever people congregate together, of course. The alternative is the borough. Most of us in the developed world crowd together in suburbs, towns or cities, the ideological descendants of the boroughs, so that we can share services like water, electricity, roads, and shopping. While we have to make some compromises to live in close proximity, we do not have to give up fundamental freedoms the way that serfs around a castle did. The reason for that is that most economically-advanced countries have cities that are governed democratically rather than by a single strongman like a feudal lord; even in the Western European Middle Ages, boroughs enjoyed many freedoms and privileges, and were at least partly self-governing. So, getting back to the blogsphere, the question is this: do we want our hub URLs to be more like castles or boroughs?

This is an important question, because it is not farfetched to suggest that the owners of the most popular hub URLs could eventually start limiting the rights of the sites or blog entries linking to them. The entertainment industry has already had great success shutting down Bittorrent trackers, which simply link to files rather than actually hosting them; several courts have issued rulings against deep linking, like this one in Munich in 2002. Even when specific rulings are later overturned, it should be clear that linking is not off limits for legal action, and it is not impossible to imagine a future where someone has to agree to restrictive terms of service or even pay for the right to link to a popular hub URL like a Technorati tag or the Microsoft web site.

Wikipedia-boro

I have already suggested that Wikipedia would be a good source of subject codes, and in essence, that means using Wikipedia URLs as hub URLs. Wikipedia is not the only choice, of course, but it seems to be a particularly good one for a few reasons:

  1. it is a collaborative site where anyone can add new potential hub URLs and modify the information in the pages they point to
  2. our rights to use it now and in the future are guaranteed by the Gnu Free Documentation License (though to be strictly pedantic, that applies to the content rather than the URL itself)
  3. linking to the Wikipedia is more likely to give you a fair description of a subject than linking the subject’s own website (think of the difference between a politician’s own web site and the Wikipedia article on the politican, and you’ll see what I mean)

If enough people start linking to Wikipedia articles in their weblog postings, topic-based RSS or Atom feeds will become very easy: for example, Feedster will happily give you an RSS feed of weblog postings linking to the current U.S. President Bush or a feed with postings that explicitly link to the country Canada: presumably, these articles are treating these topics as major subjects, rather than just mentioning them in passing, so the contents of the search feeds should be highly relevant (imagine how many false hits you’d get from mailing address, etc., just searching for the word “Canada”).

L10N out of control

Wednesday, February 9th, 2005

[update: a mitigating factor] Localization (L10N) is a good thing in general: people like to see the languages, punctuation, and systems of measure that they’re used to. So, hats off to Google’s new beta map service for putting most of the streets names in Ottawa’s west end in French.

The only trouble is that the street names are actually English — we have Carling St, Holland Ave, and the Island Park Drive, not Rue Carling, Avenue Parkdale, or Promenade Island Park.

What went wrong? My guess is that Google (or their data provider) uses a vector map for L10N, either in real time or (more likely) pregenerated. Ottawa is right on the Quebec border, and the streets might have been misidentified as located in Quebec because the map doesn’t have enough resolution to follow the bends in the provincial border. [Update: to be fair, I should mention that some streets in the west end of Ottawa do have bilingual signs that say both rue and St., for example -- since we're the capital of a bilingual country, the city tries to set an example.]

Over all, Google’s mapping service is very impressive, especially for a beta, and this particular glitch is more funny than disruptive. I’m grateful that they even included Canadian cities in the first release.

xml:id

Tuesday, February 8th, 2005

Anne van Kesteren’s is the first report to reach me that the W3C’s xml:id spec has just moved up the food chain to Candidate Recommendation. I’m usually one of the first people to whine about too many XML-related specs, but I think this is a good one, despite a few minor problems like an incompatibility with XML Canonicalization.

Why does this matter? Any use of XML over the web that requires DTD or schema processing is broken because of all the extra security and availability risks involved in processing external files, especially when they’re hosted at other sites. The xml:id spec gives a quick and dirty way of identifying parts of an XML document without requiring a schema, DTD, or even a namespace declaration (since the xml: prefix is predeclared for XML documents). Basically, you just use something like this inside your XML document:

<employee xml:id="dmeg123">
 <name>David Megginson</name>
 <role>Housekeeping</role>
</employee>

and you’re done. Other XML documents can refer to part of yours using a fragment identifier, as in http://www.example.org/employees.xml#dmeg123, and that’s that — no schemas are harmed in the making of this link. I don’t know if XML data on the web ever will take off, but this small spec is a critical step in the right direction. Congrats to the editors and the working group for pushing it through this far.

If only we could make everything in XML this simple.

The complexity of XML parsing APIs

Tuesday, February 8th, 2005

Dare Obasanjo recently posted a message to the xml-dev mailing list as part of the ancient and venerable binary XML permathread (just a bit down the list from attributes vs. elements, DOM vs. SAX, and why use CDATA?). His message including the following:

I don’t understand this obsession with SAX and DOM. As APIs go they both suck[0,1]. Why would anyone come up with a simplified binary format then decide to cruft it up by layering a crufty XML API on it is beyond me.

[0] http://www.megginson.com/blogs/quoderat/archives/2005/01/31/sax-the-bad-the-good-and-the-controversial/

[1] http://www.artima.com/intv/dom.html

I supposed that I should rush to SAX’s defense. I can at least point to my related posting about SAX’s good points, but to be fair, I have to admit that Dare is absolutely right — building complex applications that use SAX and DOM is very difficult and usually results in messy, hard-to-maintain code.

The problem is that I have not yet been able to find an XML API that doesn’t, um, suck. So-called simplified APIs like StAX or JDOM always look easier with the simple examples used in introductions and tutorials, but as soon as you try to use them in a real-world application, their relatively minor advantages disappear in the noise of trying to deal with the complexity of XML structure. For example, late last week I had decided to use StAX instead of SAX for a library I was writing, since it was getting very hard to manage context and flow control in a push parsing environment and my SAX handler had become (predictably) long and messy. After an hour I realized that my StAX handler had become even longer and harder to read than the original SAX-based code, even though StAX lets me use the Java runtime stack to manage context instead of forcing me to do context management on my own. Oh well. StAX looked so much easier in Elliotte Rusty Harold’s excellent tutorial, but as soon as I moved away from toy examples to a real XML data format, everything fell apart.

My old SGMLSpl library was also hard to use, so we have a long history of awkward APIs in the markup world. Only if you can restrict the kind of XML you’re dealing with somehow — say, by banning mixed content or even using a data metaformat like RDF or XTM (more on these in a later posting) — can the user APIs get a little simpler, because the library can do some preprocessing for you and give you a predigested view of the information.

Blame Larry Wall

Friday, February 4th, 2005

Larry Wall

Late yesterday I was working on a mind-numbingly simple XML data library in Java for use with a larger project. I spent about an hour on the first iteration, which could read and write through an event-interface and/or into a data tree but used only simple names. After supper, I came back and spent another hour writing a beautifully elegant XMLName class and refactoring the rest of the code to support namespace-qualified names. The class supported getters and setters for the namespace URI and local name, equals, and hashCode methods, and at one point, support for the Comparable and Serializable interfaces, but it went even further — to support the flyweight design pattern it was declared final and had a weak-reference lookup table for internalization, like the Java String class. To go even further, it had a static intern method that took two arguments, so that you could create an internalized XMLName directly without having to construct a non-internalized version first:

XMLName name = XMLName.intern("http://http://www.w3.org/1999/xlink", "href");

In other words, it was pretty cool — fast, memory-efficient, and properly designed. I’m sure that many of the people reading this posting have designed similar classes for XML work and taken similar pride in them. Unfortunately, before I went to bed, I realized I’d have to delete the class when I got up in the morning.

Why? I blame Larry Wall for all my grief, because it was his voice that started playing in my head, saying “easy things should be easy, and hard things should be possible.”

I messed up because I was focussing on the harder part of the problem. For simple XML configuration files, most people won’t be using namespaces most of the time, so forcing them to write

branch.setName(XMLName.intern(null, "foo"))

instead of

branch.setName("foo")

is a bad idea. Of course, I could hide that behind the scenes by adding extra method calls, say, setNameString and getNameString, but then I end up cluttering up my code (harder to learn, more bugs, trickier maintenance, etc.), again, just in an attempt to make the hard case easier.

The right solution for this particular library is one that James Clark suggested back in 1998 or 1999 when we were first trying to figure out how to get namespace support into SAX, and one that I sometimes wish we had taken up (though it’s not one of my biggest regrets): represent any XML name as a single string, with the namespace URI and the local name merged together. James preferred surrounding the namespace URI in braces, like this: “{http://http://www.w3.org/1999/xlink}href”; other option is to separate the two with a space, like this: “http://http://www.w3.org/1999/xlink href”. Of course, any library that does this should provide helper functions for splitting the string into its two parts or recombining.

So, while I’m still channelling Larry’s voice, let’s see how well this solution fits. First, the easy case:

String name = branch.getName();
branch.setName("foo");

OK, looks good: the easy thing is easy. Now, the hard case:

String name = branch.getName();
String parts[2] = Utils.splitName(name);
branch.setName(”{http://www.example.org/ns}foo”);

The hard thing is not easy, but it’s possible. Perhaps Larry’s voice will leave my head now, and I can get on with life and coding, in that order.

Perl XML::Writer has a good home

Thursday, February 3rd, 2005

I just stumbled on this posting, and was happy to see that the Perl version of my XML writer (a library for creating XML) has found a good home. I originally wrote the XML writer in both Java and Perl versions, but the Perl version was always the neglected sibling — I just don’t use perl that much any more, and wasn’t motivated to fix bugs, add features, etc.

Over the years, several people offered to take over maintenance of the Perl branch, but usually nothing came of it, and I lost track of who, if anyone, was supposed to be managing it. I recently revived the XML-Writer Sourceforge project and have been doing some maintenance on the Java branch, but again, hadn’t looked at the Perl.

So I’ll do some work on the Java, but will leave the Perl in better hands. This is a small but nice example of how open source is supposed to work: the people who care the most are the ones who do the work, and when the original maintainer loses interest, others are ready to step in.

SAX: biggest satisfactions

Wednesday, February 2nd, 2005

Recently, I mentioned my biggest regrets about SAX. When we were building SAX, however, there were an awful lot of things that went right. Here are the three things that I’m happiest about:

SAX was useful right from the start

Not just useful, in fact, but more useful than any alternative at the time. When I wrote the first draft of SAX over Christmas 1997 and put it up on the xml-dev mailing list for discussion and review in January 1998, the package included not only an interface definition but driver/adapters for all four existing Java XML parsers: Jame’s Clark’s XP, Tim Bray’s Lark, Microsoft’s MSXML (I don’t think a Java version is still available), and my own AElfred (now maintained by others). That meant that right away, a Java developer would be able to write code that worked with any existing XML parser.

This was an important point because I was afraid that the big computer companies (IBM and Oracle were also working on parsers) were going to try to lock developers into their platforms through proprietary parser interfaces. XML is an open format, but if all your code and all the libraries available to you work only with (say) IBM’s or Microsoft’s parser interface, then you haven’t gained much over using a proprietary format.

Another advantage, that I hadn’t anticipated, was that people started developing large-scale projects with SAX right away, so they shook out bugs and design problems very quickly. Running code is always a good thing, but running code that actually makes developers’ lives easier trumps anything else.

SAX is efficient

There are so many things that we could have done to kill SAX’s efficiency: we could have returned strings for character data instead of arrays (which can be indexed directly into the parser’s buffer); we could have returned elaborate objects for events, managed from some kind of pool; we could have managed a context stack for the user, whether she needed it or not; but we did none of those things. I was tempted, sometimes, but the other volunteers in the project quickly slapped me back into line.

The rationale was simple: it is easy to build all of those things on top of SAX if you need them (and, in fact, Michael Kay’s SAXON started life as a friendly SAX helper library, before it evolved into an XSLT framework), but there is no way to remove them if you don’t need them. As a result, SAX concentrated on standardizing the way that parsers deliver information rather than providing a friendly user experience — once that was standardized, it would be easy to build layers on top that would work with any parser. In short, the motto was do no harm rather than make it fun and simple; it turned out being a perfect example of worse is better.

I had assumed that just about everyone would work through those higher-level libraries, but in the end (to my surprise), lots of developers learned to love the clumsy, low-level SAX interfaces in all their ugly glory. I myself have messed around with writing higher-level libraries on top of SAX, only to go back to the raw ContentHandler and its friends every time. For some reason, hard-core XML developers like to stay close to the metal, no matter how many friendly high-level tools people offer them.

SAX supports filter chains

SAX filter chains may seem obvious now, but I doubt I would ever have been able to think them up. I cannot remember who first suggested using SAX handlers in chains, like a Unix pipeline — perhaps the idea just evolved gradually as a kind-of group think — but it was well established by SAX2 and officially supported by a dedicated interface. We don’t support filters perfectly (error handling is a bit kludgy), but people make beautifully simple yet powerful systems using them.

I don’t think that there will ever be substantial changes to SAX. Now that I’ve resumed maintaining it, I’ll try to fix bugs and keep it up to date with any new XML versions, but otherwise, it is what it is. Perhaps something newer, like StAX or some other pull interface will eventually displace SAX, and that would be fine too. For now, though, it is an essential part of the XML infrastructure, used at tens or hundreds of thousands of sites, and the best thing I can do is keep it stable and make as few changes as possible.

Wikipedia URLs as blog subject codes

Tuesday, February 1st, 2005

[Updated] Over in my aviation weblog, I find myself more and more linking to Wikipedia whenever I’m discussing a concept, person, place, or anything else that doesn’t have its own, canonical home page. If, as I suspect, lots of other bloggers are doing the same, then links to Wikipedia articles may soon be the blogsphere’s answer to subject codes.

Wikipedia Logo

News wire services like Reuters or Dow Jones put a lot of time and money into maintaining long lists of subject codes to attach to their news products. Unlike the simple categories used in blogs, subject codes tell you not just that an article is about (say) computer technology, but that it is about specific companies, industries, people, places, and concepts. News customers use the codes to classify stories automatically, routing them to the appropriate editorial sections, displaying them on trading screens, sorting them into categories on web sites, or using them to improve searches. The providers are constantly sending out updated lists, keeping their customers’ technical departments very busy.

Should weblogs be using some kind of subject code (beyond categories)? Some areas already have standard identifiers that we could use, such as ICAO codes for airports, UPCs for retail products, ISBNs for books, CUSIPs for financial instruments, or ISO codes for countries, languages, and currencies. However, each of those requires some surrounding context: you need not only the code, but some indication that it refers to a currency or an airport. They’re also managed by central authorities, making them less attractive to the weblog community.

Enter Wikipedia. If I’m posting about Washington the U.S. state, I can link to the Wikipedia article about the state; if I’m posting about Washington the U.S. president, I can link to the article about the president; if I’m posting about Washington the U.S. capital, I can link to the article about the city; and if I’m using the word Washington by metonymy to refer to the U.S. government, I can link to the article about the government.

Bingo — subject codes, just like the big newswires use, only a lot more useful and totally open. I can link to abstraction subjects like love or communism or to time periods like the middle ages just as easily as I can link to concrete people, places, or things; if there’s not already a Wikipedia article on my subject, I can always start a stub. If people keep linking to Wikipedia, search engines like Technorati and aggregators like Bloglines might start taking advantage of those links to do some automatic categorization, right down to offering links to other postings on the same subject (”Click here for other postings about Open Source“). Once people know the search engines are doing that, they’ll be bound to link to Wikipedia even more than they already are, creating a virtuous circle where both Wikipedia and the blogsphere become more valuable.

Of course, like anything that people actually do in the web (as opposed to drawing-board architectures that never get implemented), this approach is far from perfect. Once the search engines are paying attention to Wikipedia links, some people will deliberately include misleading links to have their weblog entries miscategorized, though rankings like Technorati’s should help make sure that the most relevant ones stay near the top of the list. Furthermore, Wikipedia URLs do change, especially for the sake of disambiguation, so the Wikipedia URLs will never be 100% accurate as subject codes. And finally, the Wikipedia project itself could shut down, leaving all of the subject codes orphaned. Still, since linking to Wikipedia is something many of us do anyway, it looks like a good, quick-and-dirty webby alternative to the news industry’s subject codes — it might even work better.

Update: James Tauber posted the same idea with slightly different language back in October, and has just put up a followup.