2011년 10월 23일 Do Not Fall Back to UTF-8
Everyone knows that UTF-8 is now universal; it is so universal that a half of all webpages collected by Google is in UTF-8 by 2010. This success is certainly backed by an ASCII-compatibility (and thus, an ability to encode many Latin letters without further overheads) of UTF-8, which UTF-16 certainly does not have. But it does not mean that you can use UTF-8 for everything.
Case study: You are writing an IRC client. IRC (Internet Relay Chat) is very ancient chatting protocol and still kicking around. While IRC is not designed for Unicode in mind1, it is a text-based format and all UTF-8 multibyte characters do not overlap with ASCII characters like @ or ! which are special in IRC. As a result nowadays UTF-8 in IRC is quite common, and even if the default encoding is not UTF-8 it seemed that using UTF-8 for every characters not in that default encoding is reasonable. Right?
No. You are dead wrong. It may make sense when you are limited to Western charsets like ISO 8859-1. For example, you are very sure that a two-byte sequence looks like UTF-8 is really encoded in UTF-8 even though it is found in ISO 8859-1 text: the first byte of such sequence should be an accented uppercase Latin letter and the second byte is either special characters, unused bytes or one of handful accented letters extended by Windows-1252. But this assumption breaks down when most characters used do not fall in the ASCII-compatible region. Let’s see how this problem can be devastating.
In Korea, EUC-KR was a dominant legacy Korean encoding until UTF-8, and until very recently the major Korean IRC networks only supported EUC-KR. The Hangul syllables, which contain most multibyte characters used in Korean, are encoded from B0 A1 to C8 FE. As two-byte UTF-8 sequences run from C2 80 to DF BF, roughly 9% of EUC-KR Hanguls and 11% of two-byte UTF-8 sequences overlap with each other! As a result many Korean words can incorrectly guessed as a meaningless UTF-8 text: some common examples include “킁킁”, “특징”, “치킨”, “친척”, “황홀” and so on. This is clearly unacceptable for Korean users, as they cannot recognize the resulting UTF-8 text “ůů”, “Ư¡”, “ġŲ”, “ģô” and “ȲȦ” at all.
In fact, this kind of problems can appear in any EUC encodings: EUC-KR (commonly used as a form of Code Page 949), EUC-JP (long-time standard for Unices but not others), EUC-CN (commonly used as a form of GBK) and EUC-TW (rarely used). Fortunately (or unfortunately), Japaneses gave up about using multibyte encodings in IRC a lot earlier and used ISO-2022-JP instead; now UTF-8 is becoming increasingly popular though. But I guess Chinese IRC networks have the exactly same problem.2 Unfortunately there are still many IRC clients that have this problem.
The moral of this story is that you cannot harmonize UTF-8 with other legacy encodings. If one does have to use a legacy encoding ever, he/she should use that encoding and nothing else (including UTF-8). Hence the title: Do Not Fall Back to UTF-8. Ever.
-
IRC is as old as Unicode standard, both emerged in late-1980s. And Unicode was not quite stable until version 2.0 (1996), when all existing characters are guaranteed not to change. That’s maybe why Windows 95 (1995) chose to use Unicode 2.0, even though it was in development at that time. ↩
-
I don’t really know about the encoding usage in Chinese IRC networks, though I do know that ISO-2022-CN is not widespread. ↩
