(Skip to main content.)

Blogs Quoderat Land and Hold Short

Quoderat

Archive for the 'tricks and tips' Category

Ready for Prime Time?

Saturday, May 10th, 2008

I bought a cheap HP C4280 printer-scanner-copier today, since my old HP 1210 finally gave up the ghost.

Installing the printer in Windows Vista

Installing the printer in Windows Vista wasn’t too difficult. I followed the instruction not to plug in the USB cable until asked, then inserted the supplied CD-ROM and authorized Vista to run the setup.exe program. I had a click through a few screens, then I plugged in the the USB cable, let it autodetect the printer, and left it running over supper. The whole process took less than 15 minutes. When I came back in, it was finished, and I just had to dodge the ads attached to the end of the installation program. I think my non-computer-literate older relatives could have managed fine without any help from me.

Installing the printer in Ubuntu Linux

I turned on the computer. The HP C4280 appeared in the printer list.

Prime time

So who’s not ready for Prime Time on the desktop? No TV show on Prime Time is without flaws, and no OS is without flaws — Ubuntu still has trouble with some wireless networking cards, and pretty-much 100% of the tech support calls we made at XML 2007 were for Mac notebooks (Windows and Linux notebooks just worked, every time) — but Ubuntu makes it hard to argue that somehow Windows and Mac are good enough for the desktop, while Linux isn’t.

Strange web exploit attempt (?)

Monday, February 4th, 2008

In the search logs for OurAirports, I noticed a series of searches for URLs:

http://www.feliciano.de/Webgalerie/bilder/Italy/une/yiwul/
http://www.unduetretoccaate.it/codice/aseje/wocobo/
http://www.altaiseer-eg.com/ar/articles/jed/umut/

At first, I thought they might be a kind of link spam — some sites display recent searches — but when I checked one of the URLs, I found something totally unexpected:

<?php echo md5("just_a_test");?>

They’re all the same. This is almost certainly related to passwords: is there a known flaw in a PHP content-management system like Drupal, or in the PHP API for a search engine like Lucene, where this would do some damage, or is it just a test probing for weaknesses? Is the PHP code supposed to be served up literally like that, or should I be seeing the MD5 instead?

LAMP stack stability

Thursday, January 10th, 2008

I’m using a single dedicated server to host ourairports.com, megginson.com, and a couple of minor domains. OurAirports is a database-heavy application using (currently) a MySQL v.5 database hosted on the same server. I’ll offload the database to a separate server if traffic keeps increasing, but as long as I’m getting compliments from tech people for my fast response times (mainly thanks to MySQL’s built-in query caching), there’s no point paying for extra hardware.

Uptime

My ISP set up the server for me last summer with a bare-bones Ubuntu distro, then I installed the extra packages I needed using aptitude over ssh. Since then, I’ve done many Ubuntu in-place upgrades, rolled out hundreds of changes and upgrades to the web apps and dozens to the database schema (some very significant), and upgraded WordPress n-teen times. Check this out:

$ uptime
 13:08:31 up 175 days, 10:02,  1 user,  load average: 0.23, 0.06, 0.02

That’s right — since my ISP first set up the server with a basic Ubuntu system, I’ve never had to restart it. In fact, if Apache and mod_php (PHP5) had ‘uptime’ commands, they’d show almost the same amount of time, since I restarted them only to make configuration changes in the first few days of setting up the server (unless apt stopped them to install a newer version during one of my upgrades). I’ve restarted MySQL more recently, but again, only to experiment with configuration changes (especially for fulltext).

-1 for being cool, +10 for having a life

Using reliable old technologies like Linux, Apache, MySQL, and PHP doesn’t win any cool points, but it certainly makes maintaining a web server and its applications easy. I can go on vacation, for example, without worrying about being able to get online to fix or restart my server every couple of days. I don’t have to stay up until 3:00 am on Sunday night so that I can take the server offline to roll out new software versions or bug fixes (aptitude installs any security fixes in place). I spend lots of time with my family. I go to my kids’ school concerts. I learned banjo and mandolin (why not, since I have the free time?).

It’s the developer, not the language

And yes, my PHP web app is easy to maintain and extend, because I designed it to be that way (I can often implement, test and roll out new features in a matter of minutes, even when they require database schema changes) — it’s the developer, not the programming language, that determines the quality and maintainability of an app. A lot of newbies use PHP, so there’s a lot of bad PHP out there, but the same can be said for any language, even Ruby.

Two problems with Google Maps for aviation

Wednesday, August 29th, 2007

I love Google Maps and their API, and am using it extensively in my new web site OurAirports. However, there are two problems that keep coming up for using Google Maps with an aviation application:

[Diagram of Mercator projection]

  1. Google Maps uses a Mercator Projection, grossly distorting the northern and southern parts of the world, and cutting off the area near the poles so that a few of the Antarctic airports don’t show up on my maps at all. I can understand the reasons for their choice, with simple panning and tile paging and a rectangular area, but it can make things look pretty silly sometimes (such as Greenland and Africa appearing the same size).

  2. Google Maps does not provide an API call to draw a great-circle path. This seems to me to be almost a no-brainer, and it’s especially important in a Mercator projection, where the apparently straight paths drawn by the API are anything but (especially east-west). After messing with some out-of-date third-party libraries, I finally found some JavaScript at one site that does a good job on efficient, approximate great-circle paths, and am waiting to hear from the author about terms for reuse. Google might want to just go ahead and add this, though.

[Diagram of Mercator projection]

Aviation charts mostly use a Lambert conformal conic projection, which ensures that distances are preserved (any two points the same distance apart on the chart are the same distance apart in the real world); however, by definition this projection can’t show more than half the world at once, and generally shows much less than that, so it wouldn’t work for something like Google Maps.

[not] Protecting web sites and services from DNS rebinding attacks

Wednesday, August 1st, 2007

Update: Nope, my solution won’t work. As Christian Matthies points out in the comments, it is possible to spoof the HTTP Host header as well (his link in the comment is broken because of an extra comma, but this one works). As a kludge, browsers could be modified to prevent Host header spoofing, but (a) it would take a long time to deploy to the world at large, and (b) it would be only a bandaid for a much bigger problem.

Summary: While there’s no way to protect browsers against the DNS rebinding attack, you can protect web sites and web services by forcing them to check the HTTP Host header with every request. This is easy to do for RESTful services going through a regular web server like Apache — you get it by default with virtual hosts — but might be trickier for WS-* services.

If you or your company is using HTTP-based web services (either WS-* or REST), you might be in trouble — a new exploit allows a web site from outside your firewall to use a web browser as a proxy to read any web site or service inside your firewall.

Artur Bergman at O’Reilly has a posting on the DNS rebinding (aka anti-DNS-pinning) attack that works against all major browsers, including all versions of Firefox and MSIE. There’s no obvious general fix for this, though there’s a Firefox extension that helps a tiny bit.

The attack

In a DNS-rebinding attack, the attacker is able to force your browser to read data from any IP address that your browser has access to, even if you’re behind a router/firewall, by changing the IP address associated with a domain name you’ve connected to. That means that given an IP address, an outside attacker can read your local website (at 127.0.0.1), anything behind your corporate firewall (such as an Intranet accounting page or a web service), or — I think (haven’t tested yet) — a website that you’re logged into using a cookie (HTTP authentication will force a popup, since the browser will see a different domain name, even if you’re logged into the site in another tab/window). If you run a local web server on your computer (say, at 127.0.0.1), you can go to http://www.jumperz.net/index.php?i=2&a=1&b=7, type in the local address, and see jumperz.net use the exploit display the source of your home page.

The defence

There’s no way to protect the browser yet, but you can protect your HTTP-based sites and services from this attack very easily — in fact, many sites on the web are already unknowingly protected, though I don’t know if most enterprise web services are.

The trick is in the HTTP Host header. While the DNS rebinding attack can associate a new IP address with a hostname, it cannot change the hostname itself, so the browser will still send the original hostname to the new host. Nearly all shared-hosting servers — and many servers at dedicated hosts as well — will check the Host header to decide what pages to serve out. As long as the site does something harmless when it gets an unrecognized hostname (such as returning a “501 Not implemented” HTTP status code), the site will be safe the attack. In Apache, for example, you use the ServerName directive for each virtual host, and just make sure that there’s a default virtual host that returns an error or at least does nothing harmful.

For Web Services, the same thing applies. It’s often tempting to use IP addresses instead of hostnames for web services (including RESTful services), especially during development, but doing so opens you right up to a DNS-rebinding attack, which could be very harmful if you’re using real data for development and testing. To protect your HTTP-based services from this attack, you need to make sure that every web service is accessed via a hostname rather than a raw IP address, and that every service checks its hostname. For RESTful services, this is trivially easy (since you’re probably going through Apache or something similar anyway, just as with a web site); for WS-* services, I don’t know the implementations well enough to be sure, but it should be possible to force them to check the Host header somehow.

Even if you’re not building web services, managing an enterprise intranet, or running a public web site, don’t forget to protect the web server on your local computer, if you have one.

Three simple tips for LAMP web site developers

Saturday, July 21st, 2007

You’ve learned to write some basic HTML, CSS, PHP/Python/Perl and SQL, found a hosting service, and are ready to create your first LAMP web application. You’ve already read a bit about security (you know always to escape user-supplied parameters, etc.). Here are a three very simple tips that will help you along right at the start, without getting caught up in religious wars about frameworks, MVC, REST, abstraction, object orientation, etc.:

  1. Keep all the database code together. Put all your database calls into a single source file if you can — functions like mysqli_query (PHP) should never appear anywhere else but in this file — and create neutral functions like get_member() or delete_cart() for the rest of your code to call. The reason for this is not so that you can switch databases in the future (that’s easy enough to fix), but so that you can easily do a search/replace when you rename or modify tables. If all your database code is in the same place, your application will be orders of magnitude easier to maintain and upgrade a few months from now. Seriously.

  2. Make an extra database for junk. If your hosting account allows more than one database, create at least two, say “foo” and “foo_cache” — put all the tables you need to back up into the first one, and all the stuff you don’t need to back up (views, caching tables, session states, etc.) into the second. Write a SQL script to automatically regenerate any required tables in “foo_cache” when you restore. That way, you won’t waste time and bandwidth every day backing up megabytes or gigabytes of stuff you don’t need and can easily regenerate.

  3. Make GET harmless. If you use HTTP GET (e.g. $_GET in PHP) to do things like deleting or modifying records, bad things will happen to your application — search engines will start randomly changing your database by following links (robots.txt might not be enough to protect you), browsers will delete records by trying to precache pages, etc. Always use POST (normally from a form button) for anything that can make a change. More here.

Coding lessons from university

Wednesday, June 27th, 2007

Dare Obasanjo, smart code guy and occasional punching bag for the anti-Microsoft people, is collecting lists of Three Things I Learned About Software In College. I posted mine in a comment on his blog, but decided to reproduce them here. Note that these are not lessons you learned 10 or 20 years later, but what you discovered back then.

I coded a lot in university — some of it for pay — but fortunately, I didn’t study computer science or engineering. Here are my major lessons:

  1. Readable code goes further and survives longer than optimized code, especially once you’re no longer the one maintaining it (or if you have to come back to it two years later).

  2. If you write code that makes you feel like a genius, throw it out — you’ll realize later that it’s crap. If you write code that makes you feel like a competent tradesman, you’re on the right track.

  3. No matter how smart you are, everyone — even the most incompetent loser of a coder — knows at least one thing you don’t. It’s a good idea to listen.

Note: If you want to record your own list of three things, please leave it as a comment to Dare’s original posting, not here.

My biggest problem with Wikipedia

Friday, June 22nd, 2007


Summary: You can’t partition a web site’s users into discrete groups by language.

I don’t worry much about Wikipedia’s objectivity or reliability — no sources (especially not newspapers or Britannica) are objective or reliable, and at least Wikipedia preserves its conflicts and controversies in comments and edit history — but I do have one bit problem with the project: WHY THE *^%*& DON”T THEY HAVE SINGLE-SIGNON?

I usually edit in English, but I can also make at least minor contributions to Wikipedia in French, German, Spanish, Italian, and Latin, and sometimes also contribute to Wikimedia. Every one of those requires me to create a separate account! It is absurd that my username and password for en.wikipedia.org won’t work for fr.wikipedia.org.

Don’t make this mistake with your own webapps, kids. Lots of people in the world are comfortable working in more than one language, even if they’re not fluent in all. It’s good to make a site available in more than one language, but don’t expect language to partition your users into discrete groups. Don’t lock them into a single language with a cookie, or limit their accounts to one language domain — multilingualism is extremely common around the world, even in the U.S. (how many American users would want to be able to use a site in English and Spanish if given the opportunity?)

REST, the Lost Update Problem, and the Sneakernet Test

Saturday, June 9th, 2007

Dare Obasanjo is giving a bit of pushback on the Atom Publishing Protocol, but the part that caught my attention was the section on the Lost Update Problem. This doesn’t have to do with REST per se as much as with the choice not to use resource locking, but since REST people tend to like their protocols lightweight, the odds are that you won’t see exclusive locks on RESTful resources all that often (it also applies to some kinds of POST updates as well as PUT).

How to lose a REST update

  • I check out a resource about “John Smith” (as a web form or an XML document, for example), and correct the first name field to “Jon”.
  • You check out the same resource, and correct the last name field to “Smyth”.
  • I check in my changes.
  • You check in your changes.

You have corrected the last name to “Smyth”, but have inadvertently overwritten my correction of the first name with the old value “John”, because you never saw my update.

Detection, not avoidance

Without exclusive locks, there’s no way to avoid this problem, but it is possible to detect it. What happens after detection depends on the application — if it’s interactive, for example, you might redisplay the form with both versions side by side. I don’t mean to diminish the difficulty of dealing with check-in conflicts and merges — it’s a brutally hard problem — but it’s one that you’ll have whenever you chose not to use exclusive resource locks (and even with resource locks, the problem still comes if someone’s lock expires or is overridden). Managing multi-user resource locks properly can require a lot of extra infrastructure, and they have all kinds of other problems (ask an enterprise developer about the stale lock problem), so there are often good reasons to avoid them.

State goes in the resource, not the HTTP header

Dare points to an old W3C doc that talks about doing lost-update detection using all kinds of HTTP-header magic, requiring built-in support in the client (such as a web browser). That doesn’t make sense to me. A better alternative is to include version information directly in the resource itself. For example, if I check out the record as XML, why not just send me something like this?

<record version="18">
  <given-name>John</given-name>
  <family-name>Smith</family-name>
</record>

If I check it out as an HTML form, my browser should get something like this:

<form method="post" action="/actions/update">
  <div>
    <input type="hidden" name="version" value="18" />
    Given name: <input name="given-name" value="John" />
    Family name: <input name="family-name" value="Smith" />
    <button>Save changes</button>
  </div>
</form>

When you check out the resource, you’ll also get version 18. However, when I check in my changes (using PUT or POST), the server will bump the resource version to 19. When you try to check in your copy (still at version 18), the server will detect the conflict and reject the check-in. Again, what happens after that depends on your application.

The Sneakernet Test

I think that this is far better than the old W3C solution, because it (1) it’s already compatible with existing browsers, and (2) it passes what I call the Sneakernet Test — I can take a copy of the XML (or JSON, or CSV, or whatever) version of the resource to a machine that’s not connected to the net, edit it (say, on the plane), then check it back in from a different computer — I can copy it onto a USB stick, take it to the beach, edit it on my laptop, then take it back to work and check it back in — all the state is in the resource, not hidden away in cryptic HTTP headers.

By the way, if you don’t trust programmers to be honest when designing their clients, you can use a non-serial, pseudo-random version so that they can’t just guess the next version and avoid the merge problem, but serial version numbers should be fine most of the time.

Anonymity and freedom

Monday, April 9th, 2007

Elliotte Rusty Harold is right that anonymity goes together with freedom, and I was happy to read his excellent posting How to Blog Anonymously. Rusty distinguishes three different kinds of anonymity — roughly “I don’t want to be embarrassed”, “I don’t want to be fired”, and “I don’t want to be hauled out of my bed by the secret police and shot” — and talks about the steps necessary to achieve each one.

Granted, anonymity has its ugly sides, like the disgusting online threats against Kathy Sierra and online abuse of Maryam Scoble, but it’s also sometimes the only conduit around the abusive authority of a government, employer, or even one’s peer group. As even Western democratic governments have become more authoritarian since 9/11, keeping these conduits open is more important than ever.

Granted, 99% or more of anonymous information is simply stupid or malicious, but if that’s the cost of freedom, it’s a relatively small cost to pay compared to the sacrifices our ancestors made to win us the freedoms in the first place.

REST: the quick pitch

Thursday, February 15th, 2007

Now that the Java world is noticing REST, the low-pain alternative to RPC standards like WS-*, people are starting to blog about it again. Gossip with other IT folks also tells me that people’s customers are actually asking for REST explicitly (rather than having to be convinced to use it). With that in mind, I’m going to try to explain what I think matters about REST, and what you can safely ignore.

The elevator pitch

With REST, every piece of information has its own URL.

If you just do that and nothing else, you’ve got 90%+ of REST’s benefits right off the bat. You can cache, bookmark, index, and link your information into a giant, well, web. It works — you’re reading this, after all, aren’t you? Betcha got here by following a link somewhere, not by parsing a WSDL to find what ports and services were available.

Real best practices

If you want to do REST well (rather than just doing REST), you can spend 2-3 minutes after your elevator ride learning a few very simple best practices to get most of the remaining 10% of REST’s benefits:

Use HTTP POST to update information. Here’s the simple rule: GET to read, POST to change. That way, no body deletes or modifies something by accident when trying to read it.

Make sure your information contains links (URLs) for retrieving related information. That’s how search engines index the web, and it can work for other kinds of information (XML, PDF, JSON, etc.) as well. Once you have one thing, you can follow links to find just about everything else (assuming that you understand the file format).

Try to avoid request parameters (the stuff after the question mark). It’s much better to have a URL like

http://www.example.org/systems/foo/components/bar/

than

http://www.example.org/get-component.asp?system=foo&component=bar

Search engines are more likely to index it, you’re less likely to end up with duplicates in caches and hash tables (e.g. if someone lists the request parameters in a different order), URLs won’t change when you refactor your code or switch to a different web framework, and you can always switch to static, pregenerated files for efficiency if you want to. Exceptions: searches (http://www.example.org/search?q=foo) and paging through long lists (http://www.example.org/systems/?start=1000&max=200) — in both of these cases, it’s really OK to use the request parameters instead of tying yourself in a knot trying to avoid them.

Avoid scripting-language file extensions. If your URLs end with “.php”, “.asp”, “.jsp”, “.pl”, “.py”, etc., (a) you’re telling every cracker in the world what exploits to use against you, and (b) the URLs will change when your code does. Use Apache mod-rewrite or equivalent to make your resources look like static files, ending in “.html”, “.xml”, etc.

Avoid cookies and URL rewriting. Well, maybe you can’t, but the idea of REST is that the state is in the thing the server has returned to you (an HTML or XML file, for example) rather than in a session object on the server. This can be tricky with authentication, so you won’t always pull it off, but HTTP authentication (which doesn’t require cookies or session IDs tacked onto URLs) will work surprisingly often. Do what you have to do to make your app work, but don’t use sessions just because your web framework tells you to (they also tie up a lot of resources on your server).

Speculative stuff (skip this)

The strength of REST is that it’s been proven through almost two decades of use on the Web, but not everything that some of the hard-core RESTafarians (and others) try to make us do has been part of that trial. Stop reading now if you just want to go ahead and do something useful with REST. Really, stop! Some of this stuff is moderately interesting, but it won’t really help you, and will probably just mess up your project, or at least make it slower and more expensive.

[maybe some day] Use HTTP PUT to create a resource, and DELETE to get rid of one. These sound like great ideas, and they add a nice symmetry to REST, but they’re just not used enough for us to know if they’d really work on a web scale, and firewalls often block them anyway. In real-life REST applications, rightly or wrongly, people just use POST for creation, modification, and deletion. It’s not as elegant, but we know it works.

[don't bother] Use URLs to point to resources rather than representations. Huh? OK, a resource is a sort-of Platonic ideal of something (e.g. “a picture of Cairo”), while a representation is the resource’s physical manifestation (e.g. “an 800×600 24-bit RGB picture of Cairo in JPEG format”). Yes, as you’d guess, it was people with or working on Ph.D.’s who thought of that. For a long time, the W3C pushed the idea of URLs like “http://www.example.org/pics/cairo” instead of “http://www.example.org/pics/cairo.jpg“, under the assumption that web clients and servers could use content negotiation to decide on the best format to deliver. I guess that people hated the fact that HTTP was so simple, and wanted to find ways to make it more complicated. Fortunately, there were very few nibbles, and this is not a common practice on the web. Screw Plato! Viva materialism! Go ahead and put “.xml” at the end of your URLs.

[blech] Use URNs instead of URLs. I think even the hard-core URN lovers have given up on this now — it’s precisely the kind of excessive abstraction that sent people running screaming from WS-* into REST’s arms in the first place (see also “content negotiation”, above), and it would be a shame to scare them away from REST as well. URLs are fine, as long as you make some minore efforts to ensure that they don’t change.

[n/a] REST needs security, reliable messaging, etc. The RESTafarians don’t say this, but I’m worried that the JSR (the Java REST group) will. We already have a secure version of HTTP TLS/SSL, and it works fine for hundreds of thousands or millions of web sites. Reliable messaging can be handled fine in the application layer, since everyone’s requirements are different anyway, or maybe we want a reliable-messaging spec for HTTP in general. In either case, please don’t pile this stuff on REST.

So to sum up, just give every piece of information its own URL, then have fun.

XML 2006 pickled and preserved

Friday, January 19th, 2007

The XML 2006 site is now pickled and preserved for long-term storage. Almost all of the presenters got their papers or slides in for the proceedings, if not on time, at least in time. Unfortunately, if you want to see a paper or slides from one of the few who didn’t send us anything, you’ll now have to pester them directly.

Recipe for pickling a web site

The original site was a hand-rolled LAMP implementation, but it was designed from the start to be amenable to a static copy. To pickle it, I started by doing a recursive slurp of the live site using wget (with the -m option) — that generated permanent, static HTML copies of the dynamic, database-driven pages on the site. At that point, I had an almost, but not quite perfect static copy of the site, because there were two things that wget missed:

  1. Images referred to only in CSS stylesheets (such as the banner).
  2. CSS stylesheets referred to by other CSS stylesheets.

It took only a few minutes to add all of that by hand, and the site was ready to go.

Why it worked

This will be old news to a lot of people reading, but a few simple advance steps (during site design) made later static preservation easy. Here’s what I did:

  • Every page has its own URL, period, end of discussion. No AJAX, no POST.
  • Every page (or at least, every page that we want to archive) is reachable, directly or indirectly, from the home page.
  • Script names are not shown to the public, so there are no URLs ending in “php” (hint: exposed script extensions like “php”, “asp”, or “jsp” are signs of gross incompetence in web design).
  • No web pages rely on exposed GET request parameters: for example, the URLs looked like /programme/presentations/123.html, not /programme/presentation?code=123, or even worse, /show-presentation.php?code=123.

And that’s it. Of course, if the site had included live forms, I would have had to remove those as well (and any links to them), but that wouldn’t have been much extra work.

On a final note, while the live site was hosted on an Apache server (the “A” in “LAMP”), the pickled site is hosted on a Microsoft IIS server. It made no difference at all — that’s the way Web standards are supposed to work.

Templating languages and XML

Saturday, December 23rd, 2006

Erich Schubert is talking about web templating languages. He’s looking for a pure-XML templating solution, but that might not be necessary for simple web-page design, where we don’t need all the extra benefits of heavy-duty transformation standards like XSLT.

Keeping it simple

For PHP-driven web sites, I’m a big fan of Smarty, which uses braces (”{” and “}”) to delimit template constructions. Braces have no special meaning to XML parsers (they’re just character data), so it’s possible to put a template expression inside an attribute value (for example), while keeping the template itself as well-formed XML and not requiring the elaborate paraphrastic expressions you need to set up attribute values in XSLT:

<p id="x-{$myvalue|escape}">Hello, world!</p>

Concurrent markup resurrected

Really, Smarty adds a second set of concurrent markup on top of the XHTML. Smarty constructs don’t have to balance with XML element boundaries, and with only a little care, I’ve never ended up with a Smarty template that wasn’t well-formed. JSP’s mistake was using something that looks like XML but isn’t quite, messing up parsers. Even the old SGML CONCUR feature would not have allowed markup inside attribute values. Sometimes there’s something to be said for using two different syntaxes when you’re trying to represent two different things.

How not to suck at your presentation

Tuesday, November 21st, 2006

So you’re going to speak at a conference. Congratulations!

I cannot help you much with making your presentation interesting, but at a minimum, you want it not to suck — “suck” is what happens when you annoy dozens or hundreds of people by making them wait 15 minutes while you deal with easily-avoidable technical problems. Incompatibilities between laptop computers and projectors are still common with all types of hardware and operating systems, so it is never safe to assume that your computer will work this time, even if it has in the past: I’ve seen Linux, Windows, and Mac users, in roughly equal proportions, all fall flat on their posteriors muttering phrases like “but it’s always worked before…” Things will go wrong, but you can minimize the damage by following some simple guidelines:

  • Carry an extra copy of your presentation on a CD-ROM or USB memory stick. If you have a last-minute technical problem, you can always borrow another computer and finish your presentation after only a very brief delay. Mailing a copy to yourself at a webmail address is also a good idea.
  • A screen resolution of 1024×768 is usually safe. Higher resolutions may or may not work, depending on the projector, so know how to change your resolution quickly if you need to. Seriously: practice changing your resolution at home.
  • Disabling screensavers and screen blanking will improve your chances of a successful presentation.
  • Have a backup plan if network connectivity slows down or fails (e.g. a local demo) — even if it tests OK beforehand, it might not work when there are 100 other people in the same room, using the same hub, during your presentation.
  • Start all programs (web browsers, editors, live demos, etc.) and open all windows you need before you start, and then switch to them as you need them. Murphy’s law clearly states that trying to launch a program during your presentation will fail in the worst possible way.
  • If you are using programs other than a slide presentation (such as a text editor with source code, or a web browser), set the fonts to a much larger size than normal so that the audience can read them. 18 point text is the absolute minimum, and 24 point is generally better; 12 point text or smaller is completely unreadable, especially for audience members near the back.
  • Make sure that your battery is fully charged, even if you plan to plug in your notebook during the presentation.
  • Create a separate profile or account on your computer for presentations, so that all your regular icons, bookmarks, etc. are not sitting on the screen in public view, an IM window doesn’t pop open in the middle of the presentation, etc.

I know that this is obvious, but almost every failed presentation I’ve seen failed because the presenter didn’t follow one of these steps. Go figure.

Gap buffers

Wednesday, June 7th, 2006

Tim Bray updated an old piece on binary search this morning — I missed it the first time around, so I was glad that it popped up in my blog reader. Tim’s taking some flak about data abstraction from people who don’t have his experience in high-performance environments, but what got my attention most was his mention of using gaps in a long array to provide efficient updates.

It turns out that this technique, called a gap buffer [wikipedia], is one of the cornerstones of text editors like Gnu Emacs. I’ve been using Emacs for 20 years and have contributed to the main distribution (see derived.el), but never bothered to look at the C code long enough to discover this particular technique. There’s surprisingly little information online — if anyone’s ever bothered to do testing for the optimum gap size, etc., it’s not showing up in Google — but it’s still nice to experience the joy and excitement of a new (to me), simple algorithm that solves a common problem well.

Does anyone have pointers to more detailed research on gap buffers? It seems to me that they’d have applications far beyond text editing, including (perhaps) storing compiled tree data (aka binary xml) on disk.

How many environments?

Tuesday, May 2nd, 2006

Assume that you are a lone developer, maintaining a small web site in a shared hosting account. How many software environments do you need from development to production?

One environment

On the simplest level, you could develop directly in your ISP account, loading and saving files remotely via SFTP, WebDav, etc. — in other words, your development and production environment would be the same. For anything non-trivial, that’s a pretty hairy way to work, since you have no way to test changes before they’re rolled out into the world.

Two environments

I normally use the two-environment approach that (I suspect) is the most common one for single-developer LAMP sites: I maintain a development environment on my notebook, and periodically upload changes to the production environment at the ISP. I try to run roughly the same version of Apache, PHP, MySQL, etc. as my ISP, but otherwise, I take no special steps to replicate the production environment. On my notebook, I set up the development directory as its own virtual host (e.g. http://localhost:8001/, etc.) so that I can test changes literally as I type.

Three environments?

Even though there are no other developers working with me right now, I sometimes wonder if it would make sense to start using a third environment between development and production (a separate directory and virtual host on my notebook). A third development would allow me to run major experiments and restructuring in the development code, while still making small bug fixes, typo corrections, etc. in the stable code before uploading them to my ISP production environment.

While this sounds like a good idea initially, there is a major coordination problem involved in backporting fixes from the middle environment to the development environment, and the middle environment will still become unstable while new changes are rolled out into it, tempting me to create another environment — complexity is, sadly, highly contagious. Have any other lone developers had success (or failure) with this approach?

More

Big organizations use an enormous number of environments to build and roll-out a system:

  • Each developer’s desktop, where code generally lives for a few hours.
  • The development server, typically a single server running database, application server, etc. as well as version control unit/regression tests.
  • One or more test environments, covering integration testing, system testing, user-acceptance testing, etc. (these can range from single servers to small clusters to near-duplicates of the full production environment).
  • The staging environment, which is typically very similar or identical to the production environment.
  • The production environment, where the system runs.

I’m still undecided about whether enterprises help or hurt themselves by making things so complicated — coordinating a lot of people on a big system is hard, it’s even harder to imagine an agile process functioning under so many layers of pain.

Two small, useful Nautilus shell scripts

Wednesday, March 29th, 2006

If you use a Unix-family operating system with the Gnome desktop and its default Nautilus file browser, you might know that you can extend Nautilus using simple shell scripts. Here two short and simple scripts.

Terminal window

This script, which I saved in my ~/.gnome2/nautilus-scripts/ directory as Shell, pops up a terminal window already set to the directory you’re browsing. If you want to do anything too complicated for Nautilus (or too tedious to do using a mouse), this is much more convenient than manually opening a shell window and changing the directory you’re already browsing:

#!/bin/sh

/usr/bin/gnome-terminal

Software build

This script, which I saved in my ~/.gnome2/nautilus-scripts/ directory as Make, builds a Makefile-based application inside Gnu Emacs, so that you can easily step through any errors in the source files (it would be easy to modify this to use Apache Ant or something similar):

#!/bin/sh

/usr/bin/emacs --eval '(compile "/usr/bin/make")'

It wouldn’t be too hard to rig up a variant of this to do well-formedness checking and validation of XML documents.

Simple is beautiful

I wasn’t lying when I wrote that these are short and simple scripts — it’s hard to believe how useful they are until you actually use them for a couple of days. It’s possible to do much more elaborate things with Nautilus and shell scripts, including operating on files selected in the GUI window, but as usual in tech, the biggest benefit comes from the lowest-hanging fruit.

Does anyone else have any nice 1- or 2-liners? I assume that KDE’s file browser has similar functionality, so scripts from there would also be interesting.

PHP, XML, and Unicode

Wednesday, March 1st, 2006

Update: in a comment John Cowan points out the obvious, that a UTF-8 escape sequence can never contain an ASCII character (because the high bit is always set, as I knew but failed to register). As a result, my xml_escape() function is way over-complicated. Thanks, John.

Update #2: in a comment, Jirka Kosek points out that PHP5 is actually using the also-excellent libxml instead of Expat — the PHP developers actually ported the expat-based, low-level interface to libxml so that it wouldn’t break legacy code. In that case, I’m especially impressed that my script produces byte-for-byte identical output with PHP4 and PHP5. I’m still looking for a problem with PHP’s XML+Unicode handling (other than the inconvenience of working with UTF-8 on the byte level).

Update #3: here’s a good summary of XML support in PHP5

A couple of weeks ago, Tim Bray posted about PHP and received a firestorm of comments, just as I did when I posted about PHP and Ruby on Rails almost a year ago. PHP generates a lot of passion, for good or for ill: my posting still gets a new comment every week or two.

As Tim updated his posting with comments, he linked to a two-year-old posting by Steve Minutillo about PHP4’s inability to detect character encodings in XML files and other Unicode bugs. That caught me by surprise — after all, PHP uses the venerable Expat as its XML parsing engine (the same engine used in most programming environments other than Java), and if Expat wasn’t getting things right, then the PHP people must have gone way out of their way to misconfigure it.

Testing Unicode support

To test XML character-encoding support in PHP, I used two PHP versions: 4.4.0, and 5.0.5 (which happen to be the current PHP4 and PHP5 heads in Ubuntu). I wrote a simple identity transform script (available for download at http://www.megginson.com/Software/xml-identity-transform.php — please consider it Public Domain) to read an XML file and write a simplified version of it back out again (I forgot to include processing instructions — sorry. I’ll fix that later.) The script always produces UTF-8 output, regardless of the input encoding. I ran it under both PHP4 and PHP5 against two XML source files with accented characters: one encoded in UTF-8, and the other encoded in ISO-8859-1 (with a suitable XML declaration). The script produces identical and correct UTF-8 output under both PHP4 and PHP5 (at least, the versions I tested). There is no conditional code based on the PHP version, but I did have to set a couple of options carefully.

Setting up a PHP XML parser

Here’s how I set up my XML parser in PHP:

$parser = xml_parser_create();
xml_parser_set_option($parser, XML_OPTION_TARGET_ENCODING, "UTF-8");
xml_parser_set_option($parser, XML_OPTION_CASE_FOLDING, false);

The first line creates the parser (I’m not using Namespaces for this example, or it would look a little different.) The second line requests that the parser report element names, attribute names and values, content, and everything else to my application using UTF-8, no matter what the input encoding was. The final option undoes a mind-numbingly stupid default in PHP, where all element and attribute names are converted to upper case before being passed on.

Next, I register my event handlers with the parser (this step should be familiar to anyone who has ever programmed with Expat or SAX):

xml_set_element_handler($parser, 'start_element', 'end_element');
xml_set_character_data_handler($parser, 'character_data');

The handlers themselves are naively simple, attempting to recreate the XML markup reported to them:

function start_element ($parser, $name, $atts)
{
  echo("< $name");
  foreach ($atts as $aname => $avalue) {
    echo " $aname=\"" . xml_escape($avalue) . '"';
  }
  echo(">");
}

function end_element ($parser, $name)
{
  echo("</$name>");
}

function character_data ($parser, $data)
{
  echo(xml_escape($data));
}

The only complicated bit happens in the xml_escape function. Unfortunately, since I’m dealing with raw UTF-8, I have to know a bit about UTF-8 encoding to do the escaping — otherwise, my code might mistake part of an multi-byte escape sequence for an ampersand and replace it with an entity reference (note: this is all unnecessary — see John Cowan’s comment):

function xml_escape ($s)
{
  $result = '';
  $len = strlen($s);
  for ($i = 0; $i < $len; $i++) {
    if ($s{$i} == '&') {
      $result .= '&amp;';
    } else if ($s{$i} == '<') {
      $result .= '&lt;';
    } else if ($s{$i} == '>') {
      $result .= '&gt;';
    } else if ($s{$i} == '\'') {
      $result .= '&apos;';
    } else if ($s{$i} == '"') {
      $result .= '&quot;';
    } else if (ord($s{$i}) > 127) {
      // skipping UTF-8 escape sequences requires a bit of work
      if ((ord($s{$i}) & 0xf0) == 0xf0) {
        $result .= $s{$i++};
        $result .= $s{$i++};
        $result .= $s{$i++};
        $result .= $s{$i};
      } else if ((ord($s{$i}) & 0xe0) == 0xe0) {
        $result .= $s{$i++};
        $result .= $s{$i++};
        $result .= $s{$i};
      } else if ((ord($s{$i}) & 0xc0) == 0xc0) {
        $result .= $s{$i++};
        $result .= $s{$i};
      }
    } else {
      $result .= $s{$i};
    }
  }
  return $result;
}

The rest of my code is just the normal Expat parsing loop: open a file (or URL), feed it to Expat in buffered chunks, and then report that the input is finished.

So where are the problems?

  • There may be huge problems that I somehow missed in my brief test.
  • The PHP documentation XML is not entirely clear about input and output character encodings, probably because the documentation writers were themselves a bit confused about this stuff.
  • It is possible (even likely) that bugs existed in both the PHP4 and PHP5 codebases two years ago when Steve wrote his piece, but have since been fixed.
  • It is a bit tricky working with UTF-8, since you have to remember to detect escape sequences. A PHP library would be nice. Or better yet, hide it completely, like Java does. Still, it’s only a nuisance, not a show-stopper.
  • Steve referred to the PHP XML parser’s mangling numeric character references. Expat doesn’t do that. However, it is possible that people think numerical character references refer to their current encoding, rather than to the abstract Unicode character set, and that will get them into serious trouble.
  • Expat does not support all character encodings out of the box. In fact, XML parsers are required to support only UTF-8 and UTF-16 — use any other encoding (even ISO-8859-1) at your peril, since there’s no guarantee that other XML software will be able to read it.
  • People often forget to declare what encoding they’re using.
  • Anyone who serves XML documents as text/xml is going to get in trouble no matter what language people use, because of the reencoding that might take place.

Most of these problems are not unique to PHP — XML is hard and confusing, Unicode is hard and confusing, and when you put the two together, there’s lots of opportunity for human error.

I’d be interested in the URLs of well-formed XML documents in supported encodings (UTF-8, UTF-16, US-ASCII, or ISO-8859-1, I think) that do not work properly in recent versions of PHP4 or PHP5 with the simply identity-transformation script I posted. If there are deep problems with PHP, XML, and Unicode, rather than just user confusion, I’d like to know about them.

Scanning to PDF in Linux

Thursday, January 26th, 2006

I scan documents for two main reasons:

  1. to have backup copies of my airplane’s technical logs (a plane can lose tens of thousands of dollars of value if the logs are lost); and
  2. to allow me to submit expense claims to customers by e-mail, using scanned receipts.

It’s very easy to scan individual pages to just about any format in Linux using graphical frontends like XSane or The Gimp, but when there’s more than one page, nothing beats PDF for ease of use at the receiver’s end (especially when you’ll be sending the file to an admin assistant running Windows and reading e-mail in Outlook). After a bit of experimentation, I found a few steps that actually work:

  • In the XSane preview window, preset the area to Letter size, choosing any resolution you want (150 or 300 dpi are probably the best choices).
  • Save your scans in the format of your choice.
  • Use the convert utility from ImageMagick to merge all of the scanned pages into a Postscript file. It is critical to use the -density option with your scan DPI so that the pages come out the right size, e.g. “convert -density 150 *.tiff output.ps”.
  • Use the ps2pdf utility from Ghostscript to convert the Postscript file to PDF, eg. “ps2pdf output.ps output.pdf”.

I’ve tried many other approaches (including using the libtiff utilities with all compression options, and using convert to go straight to PDF), and they all result in either huge or malformed PDF files. This is the one approach that works for me.

There must be a tool out there, GUI or command line, that willallow me to batch scan multipage documents straight into PDF without all this messing around. I haven’t found it, but I’ll be happy to hear about such a tool in comments.