Crow by Patrick Wilken

Character encoding confusion

Crafted in California by Tim Trueman   (Mon Sep 15 00:00:00 -0700 2008)

As you can see above my photography skills have gone up, although this has been at a negative correlation to my programming skills. And that's what this is about. My inexperience with the clusterfuck that is character encoding (we haven't come a long way sadly since ASCII was invented in 1963).

Alright so there I was at Open Hack 2008. I work for Yahoo! so I wasn't allowed to compete, but I just figured I'd keep everyone company and work on a personal project. It's a simple service I will be releasing shortly (I hope). I copied the code I had already over to my laptop using my Linux desktop's http server.

I started coding and after a few minutes I went to run my project. I was met with a friendly unexpected T_CONSTANT_ENCAPSED_STRING. In the dozen lines of code I had, I couldn't see the issue. I figured I was just missing something stupid so I posted to Twitter asking for help. I got a quick reply and headed downstairs to take a look at the issue. We tried to isolate the issue on the line throwing the error:

$ php -r "$subject = $argv[1];" Parse error: syntax error, unexpected ‘=’ on line 1

What the hell? I'll give you a million dollars* if you can spot the issue (ignore $argv[1] being undefined, that's not it). About an hour and a dozen engineers later (including Rasmus) we solved the issue. Here's what the code really looked like after I opened it in TextEdit.

$ php -r "$subject¬=¬$argv[1];" Parse error: syntax error, unexpected ‘=’ on line 1

Note the negation symbol before and after the equals sign. What amazes me is that none of the editors I had used before (TextMate, vim, nano) had caught the invisible gremlins.

Here's what I think happened (I'm still not 100% sure). I wrote the original code in gedit which I verified saved in the default of UTF-8 (as it should). I then transfered the code to my laptop via an http server on my desktop. Since I didn't set a charset Safari chose one for me: Latin-1 or its more catchy name, ISO-8859-1. Above the first 127 character values most charsets differ and this conversion produced gremlins that I could see when I pasted my code into a new document and saved it as UTF-8. Anyone who's had to work with double-byte languages such as Chinese are already aware of this. They've probable have seen more mojibake then they've wanted to.

The core problem

Back in the day I read Joel Spolsky's "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)". I was young and naive; I promptly brushed it off as if it would never apply to me.

My mistake…

The root of the problem is this: most web developers are unaware of the character encoding issues. So they forget to specify one. So it's up to the software you're running to pick a character encoding for them.

For email, most clients support UTF-8 finally and if they don't they should be taken out back and shot. Then stabbed with a spork. If I don't see this in the header I'm sure somewhere in the world a tiny, cute, fuzzy kitten dies:

Content-Type: text/plain; charset="UTF-8"

On the web, you should always specify your charset, like this:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">.

Use UTF-8, if you want to save cute kittens. I'm sure a few years ago most documents that didn't specify a charset were Latin-1 but today I hope that's not the case. Here's how to change the default charset in your browser:

(Mac) Safari Preferences > Appearance > Default Encoding Western (ISO Latin 1) => Unicode (UTF-8)

(Windows/Linux) Firefox Edit > Preferences > Content > Fonts & Colors > Advanced > Default Character Encoding Western (ISO-8859-1) => Unicode (UTF-8)

I still can't believe how much time charsets can cause you to lose. There should be some sort of awareness campaign with charity runs and those plastic yellow bracelets. At least I had a great time at Hack Day and I learned a ton of awesome command line and vim tricks (thanks James).

*Did I say a million dollars? I meant grains of sand.

fin

Subscribe Say hi@timtrueman.com