Lispology

Designing an electronic word game

7th March 2016

A while ago I designed an electronic word game, which I called the Conundrometer ^[1]. The idea was to display an anagram of a nine-letter word, and you were given 30 seconds to solve the anagram. It was inspired by the final Conundrum round of the popular UK TV programme, Countdown.

I was going to base the game on an ATmega328, as used in the popular Arduino single-board computer, which has 32Kbytes of memory available for program and data storage. I expected the program to be about 4Kbytes, leaving about 28Kbytes for the word data.

Here I describe my attempts to get the word data to fit in the available space.

Uncompressed

My starting point was a word list of 8678 nine-letter words. For example, the first eight words are:

"aardvarks" "abandoned" "abasement" "abatement" "abattoirs" "abdicated" "abdicates" "abdominal"

You can download them here: WordList9

Here's a program to read them into a list:

(defparameter
    *allwords9*
  (with-open-file
      (str "WordList9")
    (let (result)
      (loop
       (let ((word (read-line str nil nil)))
         (when (null word) (return result))
         (push word result))))))

A quick calculation shows that storing these as standard C strings would require at least 8678×9 bytes, or about 78Kbytes, well over the 28Kbytes available. It was clear that some sort of compression method was needed to fit them in.

Packing

Since the words only require the letters A-Z there is no need to use a full byte to store each letter; you can specify 26 letters with a 5-bit number. You can therefore pack a nine-letter word into 5×9 or 45 bits, which is 6 bytes.

Here's a routine that converts a nine-letter string into a 45-bit integer:

(defun pack-string (string)
  (let ((result 0))
    (dotimes (x 9)
      (setq result (* result 32))
      (incf result (- (char-code (char string x)) (char-code #\a))))
    result))

So, for example:

CL-USER > (pack-string "aaaaaaaaa")
0

CL-USER > (pack-string "aardvarks")
18376312146

CL-USER > (pack-string "zucchinis")
28177207670034

This reduces the requirements to 8678×6 bytes, or about 52Kbytes. Unfortunately this is still too much.

Huffman coding

Some letters occur in English words much more frequently than others, and we can take advantage of this to use a variable number of bits to encode each letter, with short bit sequences used for the more common letters, such as 'e' and 's', and longer sequences for the rarer letters, such as 'z'. We make sure that every sequence can be uniquely identified, using a system called Huffman coding, after its inventor ^[2].

First this routine make-frequencies calculates the number of occurrences of each letter in the list:

(defun make-frequencies (list)
  (let (frequencies)
    (map nil 
         #'(lambda (word) 
             (map nil 
                  #'(lambda (c)
                      (let ((item (rassoc c frequencies)))
                        (cond
                         (item (incf (car item)))
                         (t (setq frequencies (push (cons 1 c) frequencies))))))
                  word))
         list)
    frequencies))

To execute this run:

(make-frequencies *all-words9*)

This gives:

((163 . #\q) (121 . #\j) (830 . #\v) (212 . #\x) (2076 . #\m) (1385 . #\b)
 (696 . #\w) (1038 . #\y) (1091 . #\f) (3235 . #\d) (5736 . #\a) (5626 . #\r)
 (2138 . #\p) (9310 . #\e) (693 . #\k) (5334 . #\t) (2538 . #\g) (3976 . #\l)
 (4667 . #\o) (7116 . #\s) (5522 . #\n) (6905 . #\i) (1747 . #\h) (3147 . #\c)
 (2532 . #\u) (268 . #\z))

Unsurprisingly the most common letter is 'e' with 9310 occurrences.

To convert this into a Huffman tree is simple in Lisp:

Sort the list of frequencies with the smallest at the front.
Remove the two lowest-frequency letters from the list.
Replace them by a single node containing the sum of the frequencies, followed by both letters.
Repeat until the list has been converted into a tree containing a single root node.

Here's the program:

(defun huffman-tree (list)
  (loop
   (sort list #'< :key #'car)
   (let* ((a (first list))
        (b (second list))
        (item (cons (+ (car a) (car b)) (list a b)))) 
   (setq list (push item (cddr list))))
   (when (null (cdr list)) (return (car list)))))

Finally we flatten the tree to generate the code for each letter:

(defun flatten-tree (tree &optional code)
  (cond
   ((atom (cdr tree)) (list (cons (cdr tree) (reverse code))))
   (t
    (append
     (flatten-tree (second tree) (cons 0 code))
     (flatten-tree (third tree) (cons 1 code))))))

Running these routines using:

(flatten-tree (huffman-tree (make-frequencies *allwords9*)))

gives:

((#\s 0 0 0) (#\j 0 0 1 0 0 0 0 0) (#\q 0 0 1 0 0 0 0 1) (#\x 0 0 1 0 0 0 1 0)
 (#\z 0 0 1 0 0 0 1 1) (#\v 0 0 1 0 0 1) (#\h 0 0 1 0 1) (#\l 0 0 1 1)
 (#\m 0 1 0 0 0) (#\y 0 1 0 0 1 0) (#\f 0 1 0 0 1 1) (#\o 0 1 0 1) (#\e 0 1 1)
 (#\p 1 0 0 0 0) (#\u 1 0 0 0 1) (#\g 1 0 0 1 0) (#\b 1 0 0 1 1 0)
 (#\k 1 0 0 1 1 1 0) (#\w 1 0 0 1 1 1 1) (#\t 1 0 1 0) (#\n 1 0 1 1)
 (#\r 1 1 0 0) (#\a 1 1 0 1) (#\c 1 1 1 0 0) (#\d 1 1 1 0 1) (#\i 1 1 1 1))

where each letter is followed by the variable-length binary code used to encode it.

For example, the two most common letters 'e' and 's' are represented as "011" and "000" respectively, and the two least common letters, 'j' and 'q', are represented as "00100000" and "00100001".

Finally, we can calculate how much storage the wordlist will require:

(defun evaluate (list)
  (let* ((frequencies (make-frequencies list))
         (tree (huffman-tree frequencies))
         (codes (flatten-tree tree)))
    (reduce #'+ (map 'list #'(lambda (code) 
                               (let* ((c (car code))
                                      (bits (length (cdr code)))
                                      (count (car (rassoc c frequencies))))
                                 (* bits count)))
                     codes))))

To run it, do:

CL-USER > (evaluate *allwords9*)
329306

We will need 329306 bits for the entire Huffman-encoded word list, or just over 41Kbytes.

Unfortunately, as I expected, the benefit from Huffman encoding isn't worth the effort of programming it, but it was useful to see how effective it could be.

Storing differences

The breakthrough came from the idea of storing the differences between successive words. To illustrate how this works consider the first two words in the wordlist, "aardvarks" and "abandoned". Coding each of these as five bits per character we end up with the decimal numbers:

18376312146 and 34799563907.

The difference between these is 16423251761 which can be fitted in 34 bits. To find any word we start at the beginning of the list, and sum successive differences until we get to the word we want.

The differences vary wildly; from 1 ("addressed" to "addressee"), to 943455207794 ("pyromania" to "quadrants"). I therefore used a variable number of bytes to store the differences. Each difference was divided into 7-bit chunks; this was output as a sequence of bytes, where the top bit was 0 to indicate that there were more chunks, or a 1 to indicate that this was the last chunk:

(defun encode-diff (d)
  (let (list)
    (push (logior (logand d #x7F) #x80) list)
    (loop
     (setf d (ash d -7))
     (when (zerop d) (return list))
     (push (logand d #x7F) list))))

For example:

CL-USER > (encode-diff (- (pack-string "aardvarks") (pack-string "aaaaaaaaa")))
(68 58 65 10 210)

I tested how much memory this would require with test-size:

(defun test-size (words)
  (let* ((start "aaaaaaaaa")
         (size 0))
    (dolist (w words)
      (let* ((gap (- (pack-string w) (pack-string start)))
             (data (encode-difference gap)))
        (incf size (length data))
        (setq start w)))
    size))

Run this with:

CL-USER > (test-size *allwords9*)
30312

The result is just over 30Kbytes, within the 32Kbyte limit, but without enough room to spare for the program.

Finally, on a hunch I tried the same technique with each word reversed, so for example, the first eight words in the wordlist are now:

"acinomrah" "adalihcne" "adnaromem" "adnerefer" "aducarrab" "aegnardyh" "aehrronog" "aeohrraid"

These are the words "harmonica", "enchilada", etc. Here's the test:

CL-USER > (test-size (sort (map 'list #'reverse *allwords9*) #'string<))
27210

This reduced the memory requirement by about 2Kbytes! Presumably this is because the differences are more evenly distributed. Finally I was on target!

^ Conundrometer Game on Technoblogy.
^ Huffman coding on Wikipedia.

Next: Flexibility in Lisp

Previous: Calculating a resistor network

Most popular posts

About me

Feeds

Designing an electronic word game

Uncompressed

Packing

Huffman coding

Storing differences