(Skip to main content.)

Blogs Quoderat Land and Hold Short

Quoderat

Archive for March, 2006

Two small, useful Nautilus shell scripts

Wednesday, March 29th, 2006

If you use a Unix-family operating system with the Gnome desktop and its default Nautilus file browser, you might know that you can extend Nautilus using simple shell scripts. Here two short and simple scripts.

Terminal window

This script, which I saved in my ~/.gnome2/nautilus-scripts/ directory as Shell, pops up a terminal window already set to the directory you’re browsing. If you want to do anything too complicated for Nautilus (or too tedious to do using a mouse), this is much more convenient than manually opening a shell window and changing the directory you’re already browsing:

#!/bin/sh

/usr/bin/gnome-terminal

Software build

This script, which I saved in my ~/.gnome2/nautilus-scripts/ directory as Make, builds a Makefile-based application inside Gnu Emacs, so that you can easily step through any errors in the source files (it would be easy to modify this to use Apache Ant or something similar):

#!/bin/sh

/usr/bin/emacs --eval '(compile "/usr/bin/make")'

It wouldn’t be too hard to rig up a variant of this to do well-formedness checking and validation of XML documents.

Simple is beautiful

I wasn’t lying when I wrote that these are short and simple scripts — it’s hard to believe how useful they are until you actually use them for a couple of days. It’s possible to do much more elaborate things with Nautilus and shell scripts, including operating on files selected in the GUI window, but as usual in tech, the biggest benefit comes from the lowest-hanging fruit.

Does anyone else have any nice 1- or 2-liners? I assume that KDE’s file browser has similar functionality, so scripts from there would also be interesting.

The REST schism and the REST contradiction

Saturday, March 25th, 2006

Update: a proposal for a better name.

Don Box got people talking last week in a posting where he distinguishes between two kinds of REST: lo-REST, which uses only HTTP GET and POST, and hi-REST, which also uses HTTP PUT and DELETE.

The schism

If this distinction doesn’t seem very important, don’t worry — it’s not. Tim Bray captured the most important point, that Don Box (who is heavily involved in REST’s nemesis, Web Services) is talking positively about REST at all. For the RESTafarians and some of their friends, however, Box’s heresy was even worse than his former non-belief, because heresy can easily lead the faithful astray: witness strong reactions from Dimitri Glazkov, Jonnay (both via Dare Obasanjo), and Dare Obasanjo himself. There is even a holy scripture, frequently cited to clinch arguments.

The contradiction

I do not yet have a strong opinion on which approach is better, but I do see a contradiction between the two arguments I hear most often from REST supporters:

  1. REST is superior to Web Services/SOAP/SOA because it’s been proven to work on the Web.
  2. Almost nobody on the Web uses REST correctly.

Pick one, and only one of these arguments, please. As far as I can see, apart from a few rare exceptions (like WebDAV), Don’s lo-REST — HTTP GET and POST only — is what’s been proven on the web. The pure Book of Fielding, hi-REST GET/POST/PUT/DELETE version is every bit as speculative and unproven as Web Services/SOAP/SOA themselves (that’s not to say that it’s wrong; simply that it’s unproven). Some REST supportors, like Ryan Tomayko, acknowledge this contradiction.

(Update) A better name?

Tim Bray proposes throwing out the REST name altogether and talking instead about Web Style. I like that idea, though the REST name may be too sticky to get rid of by now. Dumping the REST dogma along with the name would clear up a lot of confusion: HTTP GET and POST have actually been proven to work and scale across almost unimaginable volumes; on the other hand, like the WS-* stack, using HTTP PUT and DELETE remains a clever design idea that still needs to be proven practical and scalable.

XML 2006: Paper Tracks

Tuesday, March 21st, 2006

For XML 2006, which will be held in Boston from 5-7 December, we’ve decided to introduce four paper tracks. Each track will extend the full three days and will serve as its own mini-conference, concentrating on a specific area of interest (though we hope to see a lot of people moving among tracks):

  1. Enterprise XML Computing: XML in the world of big business and government — legacy system integration, service-oriented architecture, REST and web services, etc.

  2. XML on the Web: XML outside the firewall — AJAX, blogging technologies (RSS and Atom), Web 2.0, Semantic Web, publish/subscribe, tagging, etc.

  3. Documents and Publishing: authoring, managing and publishing information using XML — DITA, Docbook, XSL(T/-FO), XHTML, and much, much more.

  4. Hands-on XML: practical, workshop-oriented sessions, including last year’s popular Masters Series, case studies, tutorials, workshops, and live demos.

The official call for papers will go out at XTech 2006 in Amsterdam on Wednesday 17 May, and I hope to see many of you there. In the meantime, we’re counting on you to keep coming up with papers that educate, dazzle, and challenge, so please start thinking about what you’d like to propose for one or more of these tracks. Comments are, of course, very welcome.

(Technorati: )

RFC: (Java) SAX exceptions and new minor SAX version

Sunday, March 12th, 2006

(Note that this is not a major API change, and does not affect non-Java versions of SAX.)

Over on the sax-devel mailing list, Norman Walsh, who is involved with JAXP at Sun, has requested a small change to the SAXException class (see the archived thread).

When we were designing SAX quite a few years back, we needed the ability to embed an exception in another exception but Java did not support that, so we designed our own support. Starting with JDK 1.4, Java has supported embedded exceptions through the getCause method. Implementing getCause in SAXException would allow for more accurate stack traces and debugging, among other things.

Unfortunately, there is never such a thing as a perfectly backwards-compatible change. Chris Burdess pointed out that this change will break Java code that was calling initCause manually, and obviously, there will be some other differences in behaviour depending on which version of SAX people use. I believe that bringing SAX in line with modern Java usage (JDK 1.4 has itself been around for a while) is worth the trouble, and that very few applications would experience problems, but I’d like to see some wider discussion before I decide to put out a minor SAX release. Please let me know what you think, either by subscribing to the sax-devel list, posting a comment here, or posting your own blog entry and pinging this one.

Programming languages of distinction

Monday, March 6th, 2006

Via Ongoing, I read some interesting discussions of programming languages — mainly Python vs. Ruby, with most people happily dumping on Java.

Steve Yegge, in particular, argues that language success is based mainly on marketing, and that Python is doomed to obscurity because of the community’s lack of marketing savvy.

The programming language cycle

While I agree that Python probably is doomed to perpetual obscurity at this point, I think that Yegge’s focus on marketing is oversimplistic; instead, I’d argue that there’s a self-perpetuating cycle at work for successful programming languages:

  1. Elite (guru) developers notice too many riff-raff using their current programming language, and start looking for something that will distinguish them better from their mediocre colleagues.
  2. Elite developers take their shopping list of current annoyances and look for a new, little-known language that apparently has fewer of them.
  3. Elite developers start to drive the development of the new language, contributing code, writing libraries, etc., then evangelize the new language.
  4. Sub-elite (senior) developers follow the elite developers to the new language, creating a market for books, training, etc., and also accelerating the development and testing of the language.
  5. Sub-elite developers, who have huge influence (elite developers tend to work in isolation on research projects rather than on production development teams), begin pushing for the new language in the workplace.
  6. The huge mass of regular developers realize that they have to start buying books and taking courses to learn a new language.
  7. Elite developers notice too many riff-raff using their current programming language, and start looking for something that will distinguish them better from their mediocre colleagues.

You’ll notice that there’s no step here called “marketing”; instead, there are several distinct stages of evangelization and community building. Major vendors (other than the language’s owner, if it’s a vendor) will start to notice the language once the second wave (sub-elite) developers arrive, and IT managers will notice it because of books, magazine articles, and pressure from the high-end developers. Some — possibly a lot — of marketing will come out of those steps, but it is as much a result of the language’s success as a cause.

Points of failure

In this cycle, there are a few highly probably points of failure:

  • Timing: A new language might not be at the right stage of development (too raw, or too stale) at the time when elite developers decide to make a mass migration.
  • Features: If the new language’s features don’t answer the elite developers’ annoyance list, not enough of them will migrate to it.
  • Openness: Elite developers are used to having a lot of influence, and if the new language’s development process does not allow them sufficient say in the new language’s evolution, they will leave before they attract enough sub-elite developers.
  • Tools: Sub-elite developers might find the language unsuitable for day-to-day production use, especially if enough basic tools are not available (libraries, testing, debugging, GUI tools, performance measurement, etc.).
  • General acceptance: Regular developers might object to the new language and sabotage projects using it, either by producing poor-quality code or by missing deadlines (and blaming the new language in both cases).

Most programming languages stumble over one or more of these — it’s as much luck as clever design when a language like C++ or Java makes it past the hurdles and into the workplace. Success tends to draw more success, money draws more money, etc.

The final and most important point here is that a programming language’s perceived coolness will always suffer from its success. Java cannot possibly still be cool when there are thousands of regular developers slaving away in the bowels of ACME Widgets using it to write enterprise applications. If, in fact, Ruby displaces Java in the enterprise (which may not happen, since Ruby has no advantage over Java to match Java’s memory-management advantage over C++), it will suffer precisely the same fate, and we can expect Bruce Tate to write a book Beyond Ruby in five years or so.

By that measure, Python’s very failure is a kind of success — as long as it never really becomes takes hold in the workplace it will always carry a small degree of distinction with it, and at least a few elite developers won’t feel pressured to move on. Like a movie or band that never becomes too popular, Python will hang onto its snob appeal.

PHP, XML, and Unicode

Wednesday, March 1st, 2006

Update: in a comment John Cowan points out the obvious, that a UTF-8 escape sequence can never contain an ASCII character (because the high bit is always set, as I knew but failed to register). As a result, my xml_escape() function is way over-complicated. Thanks, John.

Update #2: in a comment, Jirka Kosek points out that PHP5 is actually using the also-excellent libxml instead of Expat — the PHP developers actually ported the expat-based, low-level interface to libxml so that it wouldn’t break legacy code. In that case, I’m especially impressed that my script produces byte-for-byte identical output with PHP4 and PHP5. I’m still looking for a problem with PHP’s XML+Unicode handling (other than the inconvenience of working with UTF-8 on the byte level).

Update #3: here’s a good summary of XML support in PHP5

A couple of weeks ago, Tim Bray posted about PHP and received a firestorm of comments, just as I did when I posted about PHP and Ruby on Rails almost a year ago. PHP generates a lot of passion, for good or for ill: my posting still gets a new comment every week or two.

As Tim updated his posting with comments, he linked to a two-year-old posting by Steve Minutillo about PHP4’s inability to detect character encodings in XML files and other Unicode bugs. That caught me by surprise — after all, PHP uses the venerable Expat as its XML parsing engine (the same engine used in most programming environments other than Java), and if Expat wasn’t getting things right, then the PHP people must have gone way out of their way to misconfigure it.

Testing Unicode support

To test XML character-encoding support in PHP, I used two PHP versions: 4.4.0, and 5.0.5 (which happen to be the current PHP4 and PHP5 heads in Ubuntu). I wrote a simple identity transform script (available for download at http://www.megginson.com/Software/xml-identity-transform.php — please consider it Public Domain) to read an XML file and write a simplified version of it back out again (I forgot to include processing instructions — sorry. I’ll fix that later.) The script always produces UTF-8 output, regardless of the input encoding. I ran it under both PHP4 and PHP5 against two XML source files with accented characters: one encoded in UTF-8, and the other encoded in ISO-8859-1 (with a suitable XML declaration). The script produces identical and correct UTF-8 output under both PHP4 and PHP5 (at least, the versions I tested). There is no conditional code based on the PHP version, but I did have to set a couple of options carefully.

Setting up a PHP XML parser

Here’s how I set up my XML parser in PHP:

$parser = xml_parser_create();
xml_parser_set_option($parser, XML_OPTION_TARGET_ENCODING, "UTF-8");
xml_parser_set_option($parser, XML_OPTION_CASE_FOLDING, false);

The first line creates the parser (I’m not using Namespaces for this example, or it would look a little different.) The second line requests that the parser report element names, attribute names and values, content, and everything else to my application using UTF-8, no matter what the input encoding was. The final option undoes a mind-numbingly stupid default in PHP, where all element and attribute names are converted to upper case before being passed on.

Next, I register my event handlers with the parser (this step should be familiar to anyone who has ever programmed with Expat or SAX):

xml_set_element_handler($parser, 'start_element', 'end_element');
xml_set_character_data_handler($parser, 'character_data');

The handlers themselves are naively simple, attempting to recreate the XML markup reported to them:

function start_element ($parser, $name, $atts)
{
  echo("< $name");
  foreach ($atts as $aname => $avalue) {
    echo " $aname=\"" . xml_escape($avalue) . '"';
  }
  echo(">");
}

function end_element ($parser, $name)
{
  echo("</$name>");
}

function character_data ($parser, $data)
{
  echo(xml_escape($data));
}

The only complicated bit happens in the xml_escape function. Unfortunately, since I’m dealing with raw UTF-8, I have to know a bit about UTF-8 encoding to do the escaping — otherwise, my code might mistake part of an multi-byte escape sequence for an ampersand and replace it with an entity reference (note: this is all unnecessary — see John Cowan’s comment):

function xml_escape ($s)
{
  $result = '';
  $len = strlen($s);
  for ($i = 0; $i < $len; $i++) {
    if ($s{$i} == '&') {
      $result .= '&amp;';
    } else if ($s{$i} == '<') {
      $result .= '&lt;';
    } else if ($s{$i} == '>') {
      $result .= '&gt;';
    } else if ($s{$i} == '\'') {
      $result .= '&apos;';
    } else if ($s{$i} == '"') {
      $result .= '&quot;';
    } else if (ord($s{$i}) > 127) {
      // skipping UTF-8 escape sequences requires a bit of work
      if ((ord($s{$i}) & 0xf0) == 0xf0) {
        $result .= $s{$i++};
        $result .= $s{$i++};
        $result .= $s{$i++};
        $result .= $s{$i};
      } else if ((ord($s{$i}) & 0xe0) == 0xe0) {
        $result .= $s{$i++};
        $result .= $s{$i++};
        $result .= $s{$i};
      } else if ((ord($s{$i}) & 0xc0) == 0xc0) {
        $result .= $s{$i++};
        $result .= $s{$i};
      }
    } else {
      $result .= $s{$i};
    }
  }
  return $result;
}

The rest of my code is just the normal Expat parsing loop: open a file (or URL), feed it to Expat in buffered chunks, and then report that the input is finished.

So where are the problems?

  • There may be huge problems that I somehow missed in my brief test.
  • The PHP documentation XML is not entirely clear about input and output character encodings, probably because the documentation writers were themselves a bit confused about this stuff.
  • It is possible (even likely) that bugs existed in both the PHP4 and PHP5 codebases two years ago when Steve wrote his piece, but have since been fixed.
  • It is a bit tricky working with UTF-8, since you have to remember to detect escape sequences. A PHP library would be nice. Or better yet, hide it completely, like Java does. Still, it’s only a nuisance, not a show-stopper.
  • Steve referred to the PHP XML parser’s mangling numeric character references. Expat doesn’t do that. However, it is possible that people think numerical character references refer to their current encoding, rather than to the abstract Unicode character set, and that will get them into serious trouble.
  • Expat does not support all character encodings out of the box. In fact, XML parsers are required to support only UTF-8 and UTF-16 — use any other encoding (even ISO-8859-1) at your peril, since there’s no guarantee that other XML software will be able to read it.
  • People often forget to declare what encoding they’re using.
  • Anyone who serves XML documents as text/xml is going to get in trouble no matter what language people use, because of the reencoding that might take place.

Most of these problems are not unique to PHP — XML is hard and confusing, Unicode is hard and confusing, and when you put the two together, there’s lots of opportunity for human error.

I’d be interested in the URLs of well-formed XML documents in supported encodings (UTF-8, UTF-16, US-ASCII, or ISO-8859-1, I think) that do not work properly in recent versions of PHP4 or PHP5 with the simply identity-transformation script I posted. If there are deep problems with PHP, XML, and Unicode, rather than just user confusion, I’d like to know about them.