I was listening to Tim Bray’s excellent talk On Language Creation today at the XML 2005 conference in Atlanta. Tim was talking about creating new XML-based markup languages (summary: “please don’t”), and in passing he mentioned the must-ignore/must-understand design pattern. For the first time, it occured to me that this pattern has a serious flaw.
The pattern
The pattern works this way: you want to let people extend your XML-based language with new elements, and you want to allow forward-compatibility so that systems don’t break if or when you upgrade the language, so it’s usually a good idea to let applications simply ignore what they don’t understand (as is the case with HTML). That’s called must-ignore. For example, if your application sees this XML document
<record>
<a>xxx</a>
<b>xxx</b>
<w>xxx</w>
<c>xxx</c>
</record>
but it does not understand the w element (maybe you added it to hold extra information for a different application), it will just pretend that the w element wasn’t there, and might process the document as if it read
<record>
<a>xxx</a>
<b>xxx</b>
<c>xxx</c>
</record>
On the other hand, if w contained some kind of crucial information that would change the application’s processing — say, by reversing the outcome or specifying an essential prerequisite (”turn off the oxygen first“) — it would be better to have the application quit and report an error instead of chugging on ahead. That’s called must-understand. Some specifications, like SOAP, actually specify these rules inside the XML instance on an instance-by-instance basis, but most simply frame them in general terms in the specification.
The problem
I realized today, however, that there’s a huge problem with this approach: must-ignore and must-understand are properties of a processing model, not a markup language. Consider an XML language for a business report: if I designate an element as must-understand, what do I really mean?
- An application must understand this element to copy this information into a database?
- A search engine must understand this element to index it?
- A formatting engine must understand this element to generate a PDF?
- An XML editing tool must understand this element to open the document?
- An XSLT engine must understand this element to do a transformation?
- An archiver must understand this element to save the report for auditing purposes (say, Sarbanes-Oxley requirements)?
Each of these represents a different processing model for the same XML document. The must-understand and must-ignore constraints will likely be different for each one, so they’re obviously not properties of the XML-based markup language. Some XML languages, like SOAP and Atom, are specified explicitly as parts of protocols, so the must-understand/must-ignore constraints are part of the protocol specification, but even then, once you have XML, you never know what clever things people will decide to do with it.