Lispology

Finding duplicates in a list

Tue, 26 Dec 2023 00:00:00 GMT

Apparently a popular programming interview question is "how would you find all the duplicate items in a list?". I thought it would be interesting to code the alternative approaches in Common Lisp, using recursive functions for elegance, and compare their performance on a list of integers.

1. Simple comparison

The most obvious approach is to compare each item in the list with every other item in the list, and if there's a match add the item to a list of duplicates.

In fact it's clear that we don't need to compare every item with every other item, but only with the subsequent items in the list, because if the duplicate is earlier in the list we will have already found it when searching with its twin.

Here's a recursive definition that prints the duplicates:

(defun print-duplicates1 (lst)
  (cond
   ((null lst) nil)
   (t (when (member (car lst) (cdr lst)) (print (car lst)))
      (print-duplicates1 (cdr lst)))))

If your Lisp has Tail Call Optimisation this will be as fast as an iterative solution.

We can modify this slightly to return the duplicates as a list as follows:

(defun duplicates1 (lst)
  (let (result)
    (labels ((dup (lst)
             (cond
              ((null lst) nil)
              (t (when (member (car lst) (cdr lst)) (push (car lst) result))
                 (dup (cdr lst))))))
      (dup lst)
      result)))

For example:

> (duplicates1 '(3 1 4 1 5 9 2 6 5 3 5 8))
(5 5 1 3)

The worst time complexity of the simple comparison version is O(n²).

2. Sort and find pairs

A better approach is to sort the list into order, and then scan the list to check for pairs of duplicates. Since the list is now sorted the duplicates will be adjacent, and only one pass through is necessary.

Here's a recursive function to print all the adjacent pairs in a list:

(defun print-pairs (lst)
  (cond
   ((null (cdr lst)) nil)
   (t (when (eq (first lst) (second lst)) (print (first lst)))
      (print-pairs (cdr lst)))))

Again, if your Lisp has Tail Call Optimisation this will be as fast as an iterative solution.

We can use this to print all the duplicates in a list with:

(defun print-duplicates2 (lst)
  (print-pairs (sort (copy-list lst) #'<)))

Alternatively we can write print-pairs more succintly by taking advantage of mapl:

(defun print-pairs (lst)
  (mapl #'(lambda (l) (when (eq (first l) (second l)) (print (first l)))) lst))

The final version to return a list of the duplicates is:

(defun duplicates2 (lst)
  (let (result)
    (mapl #'(lambda (l) (when (eq (first l) (second l)) (push (first l) result)))
          (sort (copy-list lst) #'<))
    result))

For example:

> (duplicates2 '(3 1 4 1 5 9 2 6 5 3 5 8))
(5 5 3 1)

The worst time complexity of the sort and find pairs version is O(n log n).

3. Use a hash table to count occurrences

The third approach is to use a hash table to count the number of occurrences of each item in the list. This version is straightforward:

(defun duplicates3 (lst)
  (let ((h (make-hash-table))
        result)
    (mapc #'(lambda (x)
              (when (gethash x h) (push x result))
              (incf (gethash x h 0))) lst)
    result))

For example:

> (duplicates3 '(3 1 4 1 5 9 2 6 5 3 5 8))
(5 3 5 1)

The worst time complexity of the hash table version is O(n).

Performance

Finally, I tested the performance of the three versions using LispWorks. I generated a list of a 100000 random numbers using:

(defparameter a 
  (let (result) 
    (dotimes (x 100000 result) (push (random most-positive-fixnum) result))))

I then timed each of the functions on this list:

Version	Time	Theoretical
1. Simple comparison	16.4 s	3000 s
2. Sort and find pairs	0.1 s	0.15 s
3. Hash table	0.03 s	0.03 s

The theoretical times are the predicted times for versions 1 and 2 based on the time for version 3. I'm not sure why the simple comparison is so much faster than the time complexity predicts.

A minimal GIF decoder

Fri, 16 Dec 2022 00:00:00 GMT

I recently published a small GIF decoder designed to run on AVR microcontrollers ^[1].

When I'm writing a tricky algorithm in C I often write it in Lisp first, because it's far easier to get it right, and to debug if it doesn't work. So here is the Lisp prototype of my GIF decoder, in case it's useful.

LZW compression

The GIF format treats the image as a linear stream of pixel colour values. The LZW compression it uses works by finding a sequence of pixels it has already encountered, and it encodes each sequence as a variable-length code of from 3 to 12 bits. These codes are then output as a continuous stream of bits.

Decoding a GIF requires the decoder to divide the input stream into codes containing the correct number of bits. These codes are then looked up in a 4096-element table, to translate them into the appropriate sequence of one or more pixels represented by that code.

It would be very memory intensive if we had to store the complete pixel sequence corresponding to every code in the table. Fortunately, every sequence is equivalent to an earlier sequence, extended by the addition of one pixel. We can therefore represent the sequence as a pointer to the earlier sequence in the table, plus the additional pixel.

The clever thing about LZW compression is that the lookup table doesn't need to be included with the GIF image, because it can be reconstructed dynamically as the code values are read in from the input stream. For a full description of how the GIF format works see the Wikipedia page ^[2]. I also got a lot of help from articles by Joshua Davies ^[3] ^[4] and Larry Bank ^[5].

Colour table

A GIF image can contain a colour table of up to 256 colours, so there also needs to be a 256-byte array to store this colour table. Each colour is defined by three bytes, giving the colour's R, G, and B components. Since this GIF decoder is intended for displaying images on TFT displays with a 5-6-5 colour space, we can reduce each entry in the colour table to 16 bits.

Restrictions

This decoder doesn't support: the older GIF87a format, animated GIFs, local colour tables, transparency, interlaced GIFs, or other extensions.

The compression table

Each element in the compression table used by the GIF decoder is defined by a two-dimensional array, table:

(let ((table (make-array '(4096 2))))

In each table entry (aref table n 0) is a pointer to an earlier entry, and (aref table n 1) is the pixel colour of the last pixel in the sequence encoded by this table entry.

A second array colour-table stores the GIF colour table.

Reading the bit stream

The problem of reading variable-length codes from the input file is handled by the routine get-n-bits:

(defun get-n-bits (n str)
  (loop
   (unless (< *nbuf* n) (return))
   ;; Enough left in block?
   (when (zerop *block*)
     (setq *block* (read-byte str)))
   (setq *buf* (logior (ash (read-byte str) *nbuf*) *buf*))
   (decf *block*)
   (incf *nbuf* 8))
  (let ((result (logand *buf* (1- (ash 1 n)))))
    (setq *buf* (ash *buf* (- n)))
    (decf *nbuf* n)
    result))

A call to (get-n-bits n) returns the next n-bit code from the file. This maintains a bit buffer in the global variable *buf*, and *nbuf* specifies how many significant bits are still available in *buf*. If there are fewer than n bits in the buffer, read-byte is called to read in another 8 bits and append them to the end of the buffer.

A further complication is that the bit stream is divided into blocks of up to 255 bytes, with each block preceded by a length byte. The global variable *block* stores the number of bytes remaining in the current block. If this reaches zero, it is reset by reading in another length byte.

Getting the first character of a sequence

Given a compression table element c, first-pixel follows the pointers to find the first pixel colour in the sequence it represents:

(defun first-pixel (table c)
  (let ((rest (aref table c 0))
        (this (aref table c 1)))
    (cond
     ((= -1 rest) this)
     (t (first-pixel table rest)))))

Outputting a sequence

The routine plot-sequence plots the sequence of pixels represented by the compression table element c.

(defun plot-sequence (table c)
  (let ((rest (aref table c 0))
        (this (aref table c 1)))
    (unless (= -1 rest) (plot-sequence table rest))
    (plot this)))

This effectively finds the first pixel in the sequence, by following the pointers back until a -1 pointer is reached, then finds the next pixel in the sequence, and so on, ending with the last pixel:

(aref table c 1)

Displaying the output

For debugging purposes this version of the GIF decoder prints the output as a series of digits to the Listener:

(defun plot (c)
  (format t "~a " c)
  (incf *x*)
  (when (= *x* *width*) (setq *x* 0) (incf *y*) (format t "~%")))

It could easily be modified to plot points in the appropriate colours from colour-table.

Skipping bytes

The routine skip-n-bytes is called to skip a series of unneeded bytes in the input file:

(defun skip-n-bytes (n str)
  (dotimes (x n) (read-byte str)))

Power of two test

The routine power-2-p checks that its argument is a power of two. It is used to increase the bit length of codes being read when the size of the compression table reaches a power of two:

(defun power-2-p (x)
  (zerop (logand x (1- x))))

Displaying the GIF image

Finally, here's the definition of parse-gif:

(defun parse-gif ()
  (let ((file (capi:prompt-for-file "GIF File" :pathname *last-file*)))
    (setq *last-file* file)
    (with-open-file (str file :element-type '(unsigned-byte 8))
      (skip-n-bytes 6 str) ; id and version
      (let* ((width (+ (read-byte str) (ash (read-byte str) 8)))
             (height (+ (read-byte str) (ash (read-byte str) 8)))
             (fld (read-byte str))
             (bk (read-byte str))
             (aspect (read-byte str))
             (colbits (max (1+ (logand fld 7)) 2))
             (colours (expt 2 colbits))
             (colour-table (make-array colours))
             (clr colours)
             (end (1+ colours))
             (free (1+ end))
             (bits (1+ colbits)))
        (format t "width: ~a height: ~a fld: ~b bk: ~a aspect: ~a colours: ~a~%"
                width height fld bk aspect colours)
        (setq *width* width)
        ;; Parse colour table
        (dotimes (colour colours)
          (setf (aref colour-table colour)
                (rgb (read-byte str) (read-byte str) (read-byte str))))
        ;; Make and initialise the compression table
        (let ((table (make-array '(4096 2))))
          (dotimes (n colours) (setf (aref table n 0) -1 (aref table n 1) n))
          ;; Parse blocks
          (loop
           (let ((header (read-byte str)))
             (cond
              ((eq header #x2c)
               ;; Parse image descriptor
               (skip-n-bytes 10 str)
               (setq *nbuf* 0)
               (setq *buf* 0)
               (setq *block* 0)
               (setq *x* 0)
               (let (code last)
                 (loop                  
                  (setq last code)
                  (setq code (get-n-bits bits str))
                  (cond
                   ((= code clr)
                    (setq free (1+ end))
                    (setq bits (1+ colbits))
                    (setq code nil))
                   ((= code end) (return))
                   ((null last)
                    (plot-sequence table code))
                   ((< code free) ; Found in table              
                    (setf (aref table free 0) last
                          (aref table free 1) (first-pixel table code))
                    (plot-sequence table code)
                    (incf free)
                    (when (power-2-p free) (incf bits)))
                   ((= code free) ; Not found in table
                    (setf (aref table free 0) last
                          (aref table free 1) (first-pixel table last))
                    (plot-sequence table free)
                    (incf free)
                    (when (power-2-p free) (incf bits)))
                   (t nil)))
                 (skip-n-bytes 1 str)
                 (format t "Free: ~a~%" free)))
              ((eq header #x21)
               ;; Skip extension block
               (let ((gce (read-byte str))
                     (length (read-byte str)))
                 (skip-n-bytes (1+ length) str)))
              ((eq header #x3b)
               (return))
              ;; Unknown header
              (t nil)))))))))

The routine parse-gif first reads in the GIF header, which specifies the width and height of the image, and the global colour table. If the image has c colours it then initialises the first c entries in the compression table, table, with the corresponding colour index, c. A further two entries are reserved for two special codes: clear, which resets the compression table if it is full, and end which indicates the end of the image block. The first free entry in the table is therefore at c+2.

Next parse-gif checks for three types of block: an image block, prefixed by #x2C, an extension block prefixed by #x21 which is ignored, and a terminating byte #x3B that ends the file.

In the image block the most important variables are free, that specifies the index of the next free element in the compression table, and bits, that specifies the current size of the variable-length codes.

The compression table is extended as follows, based on each code that is read in from the input file:

If code < free the code is already in the table. A new entry is added to the table consisting of a pointer to the last sequence, followed by the first pixel in the current sequence.
If code = free the code is not already in the table. A new entry is added to the table consisting of a pointer to the last sequence, followed by the first pixel in the last sequence.

In each case the sequence corresponding to code is output.

This continues until either a clear code is encountered, in which case the compression table is cleared and processing continues, or an end code is encountered, in which case the image is finished.

Running the program

I tested the program with a 48 x 48 pixel 8-colour GIF:

Here's the program output, with each colour printed as a different digit in the Listener:

Here's the complete program: Minimal GIF decoder program.

^ Minimal GIF Decoder on Technoblogy.
^ GIF on Wikipedia.
^ How LZW (GIF) Compression Works on Command Line Fanatic.
^ Inside the GIF File Format on Command Line Fanatic.
^ A Low Memory GIF Decoder on Follow me down the optimization rabbit hole.

Testing for rotations

Mon, 14 Mar 2022 00:00:00 GMT

I was recently browsing through some programming interview questions, and found one that I thought was quite interesting:

Write a function that takes two lists, and checks whether one list is a rotated version of the other.

For example:

(rotation-p '(a b c d e f) '(e f a b c d))

should return t, whereas:

(rotation-p '(a b c d e f) '(f e a b c d))

should return nil.

You can assume that the two lists are the same length.

Have a go at the problem before reading my four attempts at solving it below.

Check each rotation

Perhaps the most obvious solution is to generate each of the n rotations of the test list, where n is the length of each list, and then compare each of them with the target list:

(defun rotation-p (x y)
  (dotimes (n (length x) nil)
    (when (equal x (append (subseq y n) (subseq y 0 n))) (return t))))

The good thing about this solution is that it's easy to understand, but it's quite inefficient, and does a lot of appending of lists. Is there a more efficient way?

Find how far each item has moved

A more efficient solution is to do the following: for each item in the test list, find how far its position has moved to its new position in the target list, taking the distance modulo the length of the lists to allow for wrapping round.

If and only if all the distances are the same, the test list is a rotation of the target list:

(defun rotation-p (x y)
  (let ((len (length x))
        (d (position (first x) y))
        (i -1))
    (every #'(lambda (e) (= d (mod (- (position e y) (incf i)) len))) x)))

Here d is the distance that each item has moved as a result of the rotation.

Note that this solution assumes that there are no repeated items in the lists. It is relative easy to extend it to allow for that: check the position of each matching item in the target list and make sure that at least one has moved distance d.

Check if it is a subsequence

A very simple solution, if you're allowed to use search, is to check whether the test list occurs in the target list appended to a copy of itself:

(defun rotation-p (x y)
  (search x (append y y)))

Recursive version

Finally, for many problems it's possible to write a very neat recursive solution, but in this case I struggled a bit to find one. Here's the best I could do. The logic is as follows:

If the lists are length 1 they are trivially rotations of each other.
Find the positions of the first two items from the test list in the target list. If they don't differ by 1, modulo the length of the lists, the test fails.
Otherwise call rotation-p on the test list and the target list with the first item of the test list removed from each.

Here's the definition:

(defun rotation-p (x y)
  (cond
   ((= (length x) 1) t)
   ((/= (mod 
         (- (position (second x) y) (position (first x) y)) 
         (length x)) 
        1) 
    nil)
   (t (rotation-p (cdr x) (remove (first x) y)))))

If you can do better let me know!