메아리 저널

Spelt Number to Decimal (Updated)​

This post is last updated in 2011-07-11 13:20 UTC.

EDIT: This turns out to be my winning IOCCC 2012 entry! See the full remarks here.

I certainly like a code obfuscation and golfing: the recent example includes this and this. The today’s project is more like the former, where very short code takes much time to explain. Without further ado, here it is: (Gist)

This little program reads a spelt number from the standard input and writes the corresponding number to the standard output. It supports numbers up to 1015-1 and still weighs only 256 243 bytes of C, if you ignore the backslash at the end of line which just wraps the line. (There is also a shorter version that can handle up to 19,999,999 and does not use long long.) The input should be correct, although it will handle “a” and “and” correctly and ignore some invalid words.

It assumes the ASCII character set and 2’s complement representation, and requires int and long long to be at least 32 and 64 bits long, but that’s all they expect from the compiler. For example, it does not matter whether char is signed or unsigned, EOF does not have to be -1, and so on. (Yes, I did keep it in mind while writing this program.)

While I left the explanation of the BF interpreter as a reader’s exercise, in this time I’ll give a detailed explanation of the program. Keep in mind that there are two versions of the program; the explanation is primarily for the longer version.

Overview

The program implements a simple stemmer specific to words used in the spelt number. The full list of recognized words is as follows:

zero     one      tw-      th(i)r-  fo(u)r-  fi-      six-
seven-   eigh-    nin-     ten      eleven   twelve
hundred  thousand million  billion  trillion
a        and      -teen    -ty

The stemmer only recognizes a set of letters from the full alphabet: a, b (only for the longer version), f, g, l, n, r, s, t, y. This is sufficient for the above set of words, and also simplifies the parser a bit as two suffixes reduce to exactly two recognized letters (-tn and -ty).

The program is roughly divided in three passes: a word reader, stemmer and accumulator. A word reader converts each word to easy-to-handle numbers, a stemmer recognizes the actual word and converts to the internal tokens, and the accumulator is updated as tokens are processed. I have reconstructed the code flow since the initial posting so that it is much harder to recognize these passes, but the following annotated code should be helpful.

long long n,u,m,b;
main(e,r){
    // This loop belongs to the first pass.
    for(;n++||(e=getchar()|32)>=0;b="ynwtsflrabg"[n%=11]-e?b:b*8+n)
        // This loop belongs to the second pass, except for "e<47" check
        // which is a part of the first loop.
        for(r=b%64-25;e<47&&b;b/=8)
            // This loop belongs to the second pass...
            for(n=19;n;"1+DIY/.K430x9G(kC["[n]-42&255^b||
                // ...except for this part which belongs to the third pass.
                (m+=n>15?n:n>9?m%u*~-u:~r?n+!r*16:n*16,b=0))
                    // So does this loop body.
                    u=1ll<<6177%n--*4;
    printf("%llx",m);
}

Note the arrangement of global and local variables, which spell… the “number”. Also each pass runs separately from other passes (contrary to the appearance) as each pass resets certain variables to zero as expected by other passes; the second pass, for example, always resets n to zero so the first pass can safely assume that n is zero when resumed.

First Pass: Word Reader

In the first pass, the program reads one word (separated by whitespaces, comma, period or hyphen), filters out the unrecognized letters and finally stores the result in the variable b. The word is case-insensitive.

The actual word is stored as octal or nonary number depending on the version. But how can 11 or 10 letters fit in 8 or 9 digits? The secret is that some letters are stored as two overlapping letters. In the longer version, for example, “a” maps to “ny”, “b” maps to “nn” and “g” maps to “nw”. This won’t make the stemmer confuse because:

  • "a" only appear as a prefix or within "thousand"; the latter can be stored as "tfyn", that should be equivalent to "tsan".
  • "b" only appear as a prefix of "billion", and has an additional "-lln" appended so won’t conflict with "nine" ("nn").
  • "g" only appear as a prefix of "eight".

The letter order and digit mapping is carefully chosen to avoid any problem. The shorter version uses nonary instead of octal but the driving principle is same.

Second Pass: Stemmer

In the second pass, the program searches the word in the word table. The table consists of 19 or 17 entries where the last entry is denoted as a trailing null character of the string. If the word is not found, then the program drops one character and continues to search that word, effectively stemming the word. This approach also means that we don’t have to specially handle plurals like “millions”.

Each entry contains an one-, two- or three-letter prefix of the corresponding word; the long word is truncated as much as possible while ensuring no ambiguity, so “thousand” (“tsan”, that is, “tfyn”) is stored as “tf”. The full list of table entries is as follows:

INDEX   WORD        STORED AS   PARSED AS
0       zero        r           0
1       one         n           1
2       tw-         tw          2
3       th(i)r-     tr          3
4       fo(u)r-     fr          4
5       fi-         f           5
6       six-        s           6
7       seven-      sn          7
8       eigh-       nw (g)      8
9       nin-        nn          9
10      million     l           (6)
11      billion     nnl (bl)    (9)
12      hundred     nr          (2)
13      thousand    tf (ts[a])  (3)
14      trillion    trl         (12)
15      and         nyn (an)    (0)
16      ten         tn          10
17      eleven      ln          11
18      twelve      twl         12

The stemmer also remembers the two-letter suffix r of the current word, so “-teen” (“-tn”) and “-ty” is properly handled in the third pass. They alters the tokenized number, so “thirteen” should be tokenized as 13.

The word for “a” does not appear in the table, but the equivalent word “ny” is stemmed as “n” just like “one”. But since “and”, or equivalently “nyn” shares a same prefix and still has to be ignored, there is an entry for “and” that is specially recognized as a no-op.

Third Pass: Accumulator

In the third pass, the program modifies the accumulator m according to tokenized words. There are two kinds of numbers:

  • Non-parenthesized number k adds k to the accumulator, so “twenty-three” (tokenized as 20 3) sums to 23.
  • Parenthesized number (k) lifts the lowermost k digits of the accumulator to the next positions. For example, “one thousand, two hundred, thirty-four” roughly tokenizes into 1 (3) 2 (2) 3 4. At the time (2) is handled the accumulator is 1002; the last two digits, 02, is erased and added to the next two positions (i.e. hundreds) so the accumulator becomes 1200.

The accumulator is hexadecimal (although digits A to F are unused). This makes a code for the latter very trivial; (3) would be implemented as follows:

m = m / 0x1000 * 0x1000 + m % 0x1000 * 0x1000;

The equivalent expression m += m % 0x1000 * 0xfff; is used in the program, where 0x1000 is easily constructed by the bitwise shift. It also ensures a token for “and”, (0), does nothing. In the other hands, the last three entries seems out of place but since they really maps to 0x10, 0x11 and 0x12 they are actually in place. (The shorter version does not have the analogue.)

The amount of bitwise shifts is stored as a single modulo operation. It makes use of the Chinese remainder theorem (CRT), so that for the selected indices the remainder of a constant divided by the index computes to a desired number. (In reality, the index plus 1 is used as we cannot divide by zero.) The order of table entries for parenthesized numbers is permuted so that the constant is minimized. Also note that the entry for “and” does not have to be (0); since “and” can only appear after other parenthesized numbers, it is sure that the last two digits are always zero. Therefore “and” can also map to (1) or (2), which is exploited in order to minimize the constant further.

When the program reaches EOF, it prints the accumulator as a hexadecimal number and terminates.

Conclusion

While this program is severely restricted both in the size and the feature, it features lots of things within that restriction. One has to run through several alternatives in order to write this monster; I tested at least ten other ways to reduce the shirt amount table, for example, until the CRT is proved very helpful (it is the latest addition to the current code).

The stemming approach of this program is heavily inspired by Doug McIlroy’s legendary spell checker1. While my program does much less in much less size, the principle that the careful thinking and design makes the program bright does not change. Although mine is more clearly directed to the fun. :p

Edit: FAQ

Q: Why is it a spoken number rather than a spelt number?
A: My bad, I never realized that a spelt number is the correct term; I’ve fixed them since then. Please note that I’m not a native English speaker (or writer, if you prefer) and notable for occasionally making critical mistakes.
Q: It’s a cruder version of this Perl Golf problem, isn’t it?

A: I knew its existence, though I never consulted it during the development of this program. Of course I also knew the awesomeness of Ton Hospel’s solution to that problem. However there are fundamental differences between them:

  • C does not have helpful utilities like Perl’s y/...// or crypt. Unless you sacrifice the portability.
  • My program does not allow for non-printable characters in the code. If you allow them you can shorten it a lot, but I think it is unfair for C golfing.
  • My program takes some care of invalid words. While the pre-filtering and word table seems to be inefficient and generate many false positives, other approaches (using the second letter, for example) generate much more false positives while being longer than the pre-filtering. This was the best trade-off I could come up with.
Q: I think the program can be shortened further.
A: Yes, I received this very question immediately after the posting. I originally refused the change, but just minutes of testing showed the huge improvement so I’ve updated the program. Thanks for @geniusleonid for the initial two-byte improvement! You are still welcome to fork the gist and optimize further; I’m looking forward to it.

  1. M. D. McIlroy, Development of a spelling list, IEEE Trans. on Communications 30 (1982) 91-99. Also see the sidebar 13.8 of Programming Pearls


노트들

  1. berkus reblogged this from arachneng and added:
    Yippie!
  2. shibats reblogged this from arachneng
  3. feralstructures reblogged this from arachneng
  4. skillrex reblogged this from arachneng and added:
    can’t even begin...hundred trillions. Amazing.
  5. arachneng posted this
텀블러를 씁니다.