(Skip to main content.)

Blogs Quoderat Land and Hold Short

Quoderat

Archive for November, 2005

White-texting Google

Thursday, November 24th, 2005

[Update: the white text no longer helps Oxcyon — they're not even the top hit for their own company name any more.]

I’ve just stumbled across the most extreme white-text example I’ve ever seen, and it belongs not to a porn site but to an enterprise software vendor. Check it out:

http://www.oxcyon.com/

Scroll to the bottom of the visible text and then start highlighting with your mouse to bring into view screen after screen of white-text key words and phrases custom-designed to get the attention of Google and other search engines. Or, if you prefer, just bring the site up in lynx.

I guess this kind of thing still works. I tried a Google search for “stellent taxonomy support“, and Oxcyon — not Stellent (a competitor) — was the first hit.

Thanks, Lauren

Saturday, November 19th, 2005

Lauren Wood (by Tim Bray).

[Update: Lauren has posted her farewell message.]

On Thursday night in front a packed banquet hall in Atlanta, Lauren Wood announced her retirement as chair of the annual fall XML conference, the world’s largest XML event.

Lauren has done an outstanding job organizing and building up this conference over the past five years, but that’s only one of her many contributions to the XML community. Lauren also chaired the W3C DOM working group from its inception to the release of DOM level 2, and before that, she worked for SoftQuad, producers of one of the leading SGML and XML editors.

Lauren is now at Sun Microsystems and starting to spend some of her time on Liberty Alliance work — given Lauren’s track record so far, I’m suddenly a little more optimistic about the prospects of shared digital identity and single sign-on for the web.

Photo by Tim Bray, copyright (c) Lauren Wood, used under a Creative Commons license.

Must-Ignore and Must-Understand

Wednesday, November 16th, 2005

I was listening to Tim Bray’s excellent talk On Language Creation today at the XML 2005 conference in Atlanta. Tim was talking about creating new XML-based markup languages (summary: “please don’t”), and in passing he mentioned the must-ignore/must-understand design pattern. For the first time, it occured to me that this pattern has a serious flaw.

The pattern

The pattern works this way: you want to let people extend your XML-based language with new elements, and you want to allow forward-compatibility so that systems don’t break if or when you upgrade the language, so it’s usually a good idea to let applications simply ignore what they don’t understand (as is the case with HTML). That’s called must-ignore. For example, if your application sees this XML document

<record>
 <a>xxx</a>
 <b>xxx</b>
 <w>xxx</w>
 <c>xxx</c>
</record>

but it does not understand the w element (maybe you added it to hold extra information for a different application), it will just pretend that the w element wasn’t there, and might process the document as if it read

<record>
 <a>xxx</a>
 <b>xxx</b>
 <c>xxx</c>
</record>

On the other hand, if w contained some kind of crucial information that would change the application’s processing — say, by reversing the outcome or specifying an essential prerequisite (”turn off the oxygen first“) — it would be better to have the application quit and report an error instead of chugging on ahead. That’s called must-understand. Some specifications, like SOAP, actually specify these rules inside the XML instance on an instance-by-instance basis, but most simply frame them in general terms in the specification.

The problem

I realized today, however, that there’s a huge problem with this approach: must-ignore and must-understand are properties of a processing model, not a markup language. Consider an XML language for a business report: if I designate an element as must-understand, what do I really mean?

  1. An application must understand this element to copy this information into a database?
  2. A search engine must understand this element to index it?
  3. A formatting engine must understand this element to generate a PDF?
  4. An XML editing tool must understand this element to open the document?
  5. An XSLT engine must understand this element to do a transformation?
  6. An archiver must understand this element to save the report for auditing purposes (say, Sarbanes-Oxley requirements)?

Each of these represents a different processing model for the same XML document. The must-understand and must-ignore constraints will likely be different for each one, so they’re obviously not properties of the XML-based markup language. Some XML languages, like SOAP and Atom, are specified explicitly as parts of protocols, so the must-understand/must-ignore constraints are part of the protocol specification, but even then, once you have XML, you never know what clever things people will decide to do with it.