## Finding the n best items in a list

3rd March 2015

For several of my web sites I wanted to show a small number of the most popular items in a particular category; for example, the top ten search terms, or the top 5 products. This let me to thinking about how to do this as efficiently as possible.

In other words, suppose you've got a list of items, each of which is associated with a popularity score; for example:

`(("Greyhound" . 10) ("Wolfhound" . 12) ("Labrador" . 23) ("Poodle" . 7) ...)`

What's the most efficient way of getting the best n items?

### Simple solution

If we want the top n elements of the list, by score, the obvious simple-minded solution is to sort the list first, according to the scores, and then find the top five elements:

We first define a procedure to give the first n elements of a list as follows:

```(defun first-n (n list)
(if (> (length list) n)
(subseq list 0 n)
list))```

The solution is then achieved by the following procedure best-n*:

```(defun best-n* (n list predicate &key key)
(first-n n (sort (copy-seq list) predicate :key key)))```

However, we're sorting the whole list and then throwing away all but n of the items, which is very wasteful. There must be a more efficient way.

### Testing the simple version

First let's create some test data:

```(defparameter *test-data*
(let (result)
(dotimes (n 100000)
(let ((string (format nil "~r" n)))
(push (cons string (length string)) result)))
(nreverse result)))```

This creates an association list of the names of the first 100000 integers, and their lengths. So the first five items are:

`(("zero" . 4) ("one" . 3) ("two" . 3) ("three" . 5) ("four" . 4) ...)`

Testing the simple-minded version:

```CL-USER 1 > (time (best-n* 5 *test-data* #'> :key #'cdr))

Timing the evaluation of (BEST-N* 5 *TEST-DATA* (FUNCTION >) :KEY (FUNCTION CDR))

User time    =        0.206
Allocation   = 1300100 bytes  ```

### Improved version

A better approach is to scan through all the items, maintaining a list of just the best n items.

To do this we first define a helper function, insert, that inserts a new item into a sorted list at the correct position:

```(defun insert (item list predicate &key key)
(cond
((null list)
(list item))
((not (funcall predicate (funcall key item) (funcall key (car list))))
(cons item list))
(t
(cons (car list) (insert item (cdr list) predicate :key key)))))```

Now here's the improved version of best-n. This works as follows:

It maintains a list of the top n items found so far, in result, with the lowest at the top. For each item in the list:

• If there are fewer than n items in result, insert the new item at the correct sorted position in result.
• Otherwise if the item beats the lowest item in result, drop the lowest item and insert it in the correct position.
• Otherwise ignore it.

Here's the routine:

```(defun best-n (n list predicate &key (key #'identity))
(let ((r 0) result)
(dolist (x list)
(cond
((< r n) (setq result (insert x result predicate :key key)) (incf r))
((funcall predicate (funcall key x) (funcall key (car result)))
(setq result (cdr (insert x result predicate :key key))))))
(nreverse result)))```

### Here's an example of how it works:

After 5 items have been scanned result contains the items in reverse order:

`(("two" . 3) ("one" . 3) ("four" . 4) ("zero" . 4) ("three" . 5))`

The next item is:

`("five" . 4)`

This is larger than the first item in the list, so the first item is removed, and the new item is inserted at the correct position:

`(("one" . 3) ("five" . 4) ("four" . 4) ("zero" . 4) ("three" . 5))`

This process is repeated for all the remaining items in the list.

### Testing the improved version

Let's run the same test data with the improved version:

```CL-USER 2 > (time (best-n 5 *test-data* #'> :key #'cdr))

Timing the evaluation of (BEST-N 5 *TEST-DATA* (FUNCTION >) :KEY (FUNCTION CDR))

User time    =        0.020
Allocation   = 80040 bytes```

The improved version is a factor of 10 faster and uses less than a tenth of the storage.