## Finding the n best items in a list

3rd March 2015

For several of my web sites I wanted to show a small number of the most popular items in a particular category; for example, the top ten search terms, or the top 5 products. This let me to thinking about how to do this as efficiently as possible.

In other words, suppose you've got a list of items, each of which is associated with a popularity score; for example:

(("Greyhound" . 10) ("Wolfhound" . 12) ("Labrador" . 23) ("Poodle" . 7) ...)

What's the most efficient way of getting the best n items?

### Simple solution

If we want the top n elements of the list, by score, the obvious simple-minded solution is to sort the list first, according to the scores, and then find the top five elements:

We first define a procedure to give the first n elements of a list as follows:

(defun first-n (n list) (if (> (length list) n) (subseq list 0 n) list))

The solution is then achieved by the following procedure **best-n***:

(defun best-n* (n list predicate &key key) (first-n n (sort (copy-seq list) predicate :key key)))

However, we're sorting the whole list and then throwing away all but n of the items, which is very wasteful. There must be a more efficient way.

### Testing the simple version

First let's create some test data:

(defparameter *test-data* (let (result) (dotimes (n 100000) (let ((string (format nil "~r" n))) (push (cons string (length string)) result))) (nreverse result)))

This creates an association list of the names of the first 100000 integers, and their lengths. So the first five items are:

(("zero" . 4) ("one" . 3) ("two" . 3) ("three" . 5) ("four" . 4) ...)

Testing the simple-minded version:

CL-USER 1 > (time (best-n* 5 *test-data* #'> :key #'cdr)) Timing the evaluation of (BEST-N* 5 *TEST-DATA* (FUNCTION >) :KEY (FUNCTION CDR)) User time = 0.206 Allocation = 1300100 bytes

### Improved version

A better approach is to scan through all the items, maintaining a list of just the best n items.

To do this we first define a helper function, **insert**, that inserts a new item into a sorted list at the correct position:

(defun insert (item list predicate &key key) (cond ((null list) (list item)) ((not (funcall predicate (funcall key item) (funcall key (car list)))) (cons item list)) (t (cons (car list) (insert item (cdr list) predicate :key key)))))

Now here's the improved version of **best-n**. This works as follows:

It maintains a list of the top n items found so far, in **result**, with the lowest at the top. For each item in the list:

- If there are fewer than n items in
**result**, insert the new item at the correct sorted position in**result**. - Otherwise if the item beats the lowest item in result, drop the lowest item and insert it in the correct position.
- Otherwise ignore it.

Here's the routine:

(defun best-n (n list predicate &key (key #'identity)) (let ((r 0) result) (dolist (x list) (cond ((< r n) (setq result (insert x result predicate :key key)) (incf r)) ((funcall predicate (funcall key x) (funcall key (car result))) (setq result (cdr (insert x result predicate :key key)))))) (nreverse result)))

### Here's an example of how it works:

After 5 items have been scanned **result** contains the items in reverse order:

(("two" . 3) ("one" . 3) ("four" . 4) ("zero" . 4) ("three" . 5))

The next item is:

("five" . 4)

This is larger than the first item in the list, so the first item is removed, and the new item is inserted at the correct position:

(("one" . 3) ("five" . 4) ("four" . 4) ("zero" . 4) ("three" . 5))

This process is repeated for all the remaining items in the list.

### Testing the improved version

Let's run the same test data with the improved version:

CL-USER 2 > (time (best-n 5 *test-data* #'> :key #'cdr)) Timing the evaluation of (BEST-N 5 *TEST-DATA* (FUNCTION >) :KEY (FUNCTION CDR)) User time = 0.020 Allocation = 80040 bytes

The improved version is a factor of 10 faster and uses less than a tenth of the storage.

blog comments powered by Disqus