## Finding duplicates in a list

26th December 2023

Apparently a popular programming interview question is "how would you find all the duplicate items in a list?". I thought it would be interesting to code the alternative approaches in Common Lisp, using recursive functions for elegance, and compare their performance on a list of integers.

### 1. Simple comparison

The most obvious approach is to compare each item in the list with every other item in the list, and if there's a match add the item to a list of duplicates.

In fact it's clear that we don't need to compare every item with *every* other item, but only with the subsequent items in the list, because if the duplicate is earlier in the list we will have already found it when searching with its twin.

Here's a recursive definition that prints the duplicates:

(defun print-duplicates1 (lst) (cond ((null lst) nil) (t (when (member (car lst) (cdr lst)) (print (car lst))) (print-duplicates1 (cdr lst)))))

If your Lisp has Tail Call Optimisation this will be as fast as an iterative solution.

We can modify this slightly to return the duplicates as a list as follows:

(defun duplicates1 (lst) (let (result) (labels ((dup (lst) (cond ((null lst) nil) (t (when (member (car lst) (cdr lst)) (push (car lst) result)) (dup (cdr lst)))))) (dup lst) result)))

For example:

> (duplicates1 '(3 1 4 1 5 9 2 6 5 3 5 8)) (5 5 1 3)

The worst time complexity of the simple comparison version is O(n^{2}).

### 2. Sort and find pairs

A better approach is to sort the list into order, and then scan the list to check for pairs of duplicates. Since the list is now sorted the duplicates will be adjacent, and only one pass through is necessary.

Here's a recursive function to print all the adjacent pairs in a list:

(defun print-pairs (lst) (cond ((null (cdr lst)) nil) (t (when (eq (first lst) (second lst)) (print (first lst))) (print-pairs (cdr lst)))))

Again, if your Lisp has Tail Call Optimisation this will be as fast as an iterative solution.

We can use this to print all the duplicates in a list with:

(defun print-duplicates2 (lst) (print-pairs (sort (copy-list lst) #'<)))

Alternatively we can write **print-pairs** more succintly by taking advantage of **mapl**:

(defun print-pairs (lst) (mapl #'(lambda (l) (when (eq (first l) (second l)) (print (first l)))) lst))

The final version to return a list of the duplicates is:

(defun duplicates2 (lst) (let (result) (mapl #'(lambda (l) (when (eq (first l) (second l)) (push (first l) result))) (sort (copy-list lst) #'<)) result))

For example:

> (duplicates2 '(3 1 4 1 5 9 2 6 5 3 5 8)) (5 5 3 1)

The worst time complexity of the sort and find pairs version is O(n log n).

### 3. Use a hash table to count occurrences

The third approach is to use a hash table to count the number of occurrences of each item in the list. This version is straightforward:

(defun duplicates3 (lst) (let ((h (make-hash-table)) result) (mapc #'(lambda (x) (when (gethash x h) (push x result)) (incf (gethash x h 0))) lst) result))

For example:

> (duplicates3 '(3 1 4 1 5 9 2 6 5 3 5 8)) (5 3 5 1)

The worst time complexity of the hash table version is O(n).

### Performance

Finally, I tested the performance of the three versions using LispWorks. I generated a list of a 100000 random numbers using:

(defparameter a (let (result) (dotimes (x 100000 result) (push (random most-positive-fixnum) result))))

I then timed each of the functions on this list:

Version |
Time |
Theoretical |

1. Simple comparison | 16.4 s | 3000 s |

2. Sort and find pairs | 0.1 s | 0.15 s |

3. Hash table | 0.03 s | 0.03 s |

The theoretical times are the predicted times for versions 1 and 2 based on the time for version 3. I'm not sure why the simple comparison is so much faster than the time complexity predicts.

### Update

18th June 2024: After writing the article What use is 'mapcon'? I realised that **duplicates2** can be written more succinctly using **mapcon**:

(defun duplicates2 (lst) (mapcon #'(lambda (l) (when (eq (first l) (second l)) (list (first l)))) (sort (copy-list lst) #'<))) > (duplicates2 '(3 1 4 1 5 9 2 6 5 3 5 8)) (1 3 5 5)

The time taken is virtually the same as the original version.

blog comments powered by Disqus