메아리 저널

태그: english

Do Not Fall Back to UTF-8​

Everyone knows that UTF-8 is now universal; it is so universal that a half of all webpages collected by Google is in UTF-8 by 2010. This success is certainly backed by an ASCII-compatibility (and thus, an ability to encode many Latin letters without further overheads) of UTF-8, which UTF-16 certainly does not have. But it does not mean that you can use UTF-8 for everything.

Case study: You are writing an IRC client. IRC (Internet Relay Chat) is very ancient chatting protocol and still kicking around. While IRC is not designed for Unicode in mind1, it is a text-based format and all UTF-8 multibyte characters do not overlap with ASCII characters like @ or ! which are special in IRC. As a result nowadays UTF-8 in IRC is quite common, and even if the default encoding is not UTF-8 it seemed that using UTF-8 for every characters not in that default encoding is reasonable. Right?

No. You are dead wrong. It may make sense when you are limited to Western charsets like ISO 8859-1. For example, you are very sure that a two-byte sequence looks like UTF-8 is really encoded in UTF-8 even though it is found in ISO 8859-1 text: the first byte of such sequence should be an accented uppercase Latin letter and the second byte is either special characters, unused bytes or one of handful accented letters extended by Windows-1252. But this assumption breaks down when most characters used do not fall in the ASCII-compatible region. Let’s see how this problem can be devastating.

In Korea, EUC-KR was a dominant legacy Korean encoding until UTF-8, and until very recently the major Korean IRC networks only supported EUC-KR. The Hangul syllables, which contain most multibyte characters used in Korean, are encoded from B0 A1 to C8 FE. As two-byte UTF-8 sequences run from C2 80 to DF BF, roughly 9% of EUC-KR Hanguls and 11% of two-byte UTF-8 sequences overlap with each other! As a result many Korean words can incorrectly guessed as a meaningless UTF-8 text: some common examples include “킁킁”, “특징”, “치킨”, “친척”, “황홀” and so on. This is clearly unacceptable for Korean users, as they cannot recognize the resulting UTF-8 text “ůů”, “Ư¡”, “ġŲ”, “ģô” and “ȲȦ” at all.

In fact, this kind of problems can appear in any EUC encodings: EUC-KR (commonly used as a form of Code Page 949), EUC-JP (long-time standard for Unices but not others), EUC-CN (commonly used as a form of GBK) and EUC-TW (rarely used). Fortunately (or unfortunately), Japaneses gave up about using multibyte encodings in IRC a lot earlier and used ISO-2022-JP instead; now UTF-8 is becoming increasingly popular though. But I guess Chinese IRC networks have the exactly same problem.2 Unfortunately there are still many IRC clients that have this problem.

The moral of this story is that you cannot harmonize UTF-8 with other legacy encodings. If one does have to use a legacy encoding ever, he/she should use that encoding and nothing else (including UTF-8). Hence the title: Do Not Fall Back to UTF-8. Ever.


  1. IRC is as old as Unicode standard, both emerged in late-1980s. And Unicode was not quite stable until version 2.0 (1996), when all existing characters are guaranteed not to change. That’s maybe why Windows 95 (1995) chose to use Unicode 2.0, even though it was in development at that time. 

  2. I don’t really know about the encoding usage in Chinese IRC networks, though I do know that ISO-2022-CN is not widespread. 

Spelt Number to Decimal (Updated)​

This post is last updated in 2011-07-11 13:20 UTC.

I certainly like a code obfuscation and golfing: the recent example includes this and this. The today’s project is more like the former, where very short code takes much time to explain. Without further ado, here it is: (Gist)

This little program reads a spelt number from the standard input and writes the corresponding number to the standard output. It supports numbers up to 1015-1 and still weighs only 256 243 bytes of C, if you ignore the backslash at the end of line which just wraps the line. (There is also a shorter version that can handle up to 19,999,999 and does not use long long.) The input should be correct, although it will handle “a” and “and” correctly and ignore some invalid words.

It assumes the ASCII character set and 2’s complement representation, and requires int and long long to be at least 32 and 64 bits long, but that’s all they expect from the compiler. For example, it does not matter whether char is signed or unsigned, EOF does not have to be -1, and so on. (Yes, I did keep it in mind while writing this program.)

While I left the explanation of the BF interpreter as a reader’s exercise, in this time I’ll give a detailed explanation of the program. Keep in mind that there are two versions of the program; the explanation is primarily for the longer version.

Object Oriented CSS

I have already mentioned that the fact that CSS is declarative is a good thing, but then I cannot understand that CSS does not have a sensible module mechanism. For example, you often have to use -moz-border-radius, -webkit-border-radius and -o-border-radius in addition to border-radius to ensure the compatibility. How about the background gradients? Both Webkit and Gecko support them but only with different syntaxes. Besides from the irregularity of web browsers, you also want to reuse the same style definitions in several other classes without duplicating them, only finding that you cannot do so without updating HTML.

There are several attempts to this annoying problem, including a well-known Sass, but what we really want to see is the solution which is built-in to CSS accepted by the web browsers. At a first glance, Object Oriented CSS seems a total crap that forces you to write a HTML in very verbose and hard-to-maintain ways, but it is in fact the best solution if you don’t want to use any server-side compilers. It is a tragedy, indeed.

ere2pcre​

Yesterday someone asked me for a reasonable way to convert ereg* functions (which he had a lot) to pcre_* functions; while ereg* functions are a part of PHP for a long time, they are finally deprecated as of PHP 5.3.0 and it is time to move on. Unfortunately a quick search didn’t found some converter, so I made my own: ere2pcre.

ere2pcre is a drop-in replacement for ereg* functions. My sample implementation contains myereg* functions that can readily replace ereg* functions (to the extent that it emulates the exact interface of ereg* functions). Alternatively you can simply use ere2pcre function to convert the ERE to the PCRE and do the remaining job yourself. Which way is better depends on your situation. It doesn’t support a collation class ([.ch.]) and equivalence class ([=ch=]) as PCRE doesn’t support them either, but who cares? ;)

The source code can be downloaded from Gist. The full source code follows:

Brainfuck interpreter in 2 lines of C​

I did know about the twIP, a very small IP stack that can only respond to pings and also fit in a tweet, but recently I tried to execute the idea of making small and still usable things like that.

Well, I ended up with a slightly large program than a tweet, but it fits in two lines, i.e. 160 bytes. Let me introduce the world’s smallest Brainfuck interpreter1 in C: (Gist)

Test it yourself! For example:

$ cc bf.c -o bf
$ ./bf '++++++++++[>+++++++>++++++++++>+++>+<<<<-]>++.>+.+++++++..+++.>++.<<+++++++++++++++.>.+++.------.--------.>+.>.'
Hello World!

텀블러를 씁니다.