Lispology

Finding the n best items in a list

3rd March 2015

For several of my web sites I wanted to show a small number of the most popular items in a particular category; for example, the top ten search terms, or the top 5 products. This let me to thinking about how to do this as efficiently as possible.

In other words, suppose you've got a list of items, each of which is associated with a popularity score; for example:

(("Greyhound" . 10) ("Wolfhound" . 12) ("Labrador" . 23) ("Poodle" . 7) ...)

What's the most efficient way of getting the best n items?

Simple solution

If we want the top n elements of the list, by score, the obvious simple-minded solution is to sort the list first, according to the scores, and then find the top five elements:

We first define a procedure to give the first n elements of a list as follows:

(defun first-n (n list)
  (if (> (length list) n)
    (subseq list 0 n)
    list))

The solution is then achieved by the following procedure best-n*:

(defun best-n* (n list predicate &key key)
  (first-n n (sort (copy-seq list) predicate :key key)))

However, we're sorting the whole list and then throwing away all but n of the items, which is very wasteful. There must be a more efficient way.

Testing the simple version

First let's create some test data:

(defparameter *test-data*
  (let (result)
    (dotimes (n 100000)
      (let ((string (format nil "~r" n)))
        (push (cons string (length string)) result)))
    (nreverse result)))

This creates an association list of the names of the first 100000 integers, and their lengths. So the first five items are:

(("zero" . 4) ("one" . 3) ("two" . 3) ("three" . 5) ("four" . 4) ...)

Testing the simple-minded version:

CL-USER 1 > (time (best-n* 5 *test-data* #'> :key #'cdr))

Timing the evaluation of (BEST-N* 5 *TEST-DATA* (FUNCTION >) :KEY (FUNCTION CDR))

User time    =        0.206
Allocation   = 1300100 bytes

Improved version

A better approach is to scan through all the items, maintaining a list of just the best n items.

To do this we first define a helper function, insert, that inserts a new item into a sorted list at the correct position:

(defun insert (item list predicate &key key)
  (cond
   ((null list) 
    (list item))
   ((not (funcall predicate (funcall key item) (funcall key (car list)))) 
    (cons item list))
   (t 
    (cons (car list) (insert item (cdr list) predicate :key key)))))

Now here's the improved version of best-n. This works as follows:

It maintains a list of the top n items found so far, in result, with the lowest at the top. For each item in the list:

If there are fewer than n items in result, insert the new item at the correct sorted position in result.
Otherwise if the item beats the lowest item in result, drop the lowest item and insert it in the correct position.
Otherwise ignore it.

Here's the routine:

(defun best-n (n list predicate &key (key #'identity))
  (let ((r 0) result)
    (dolist (x list)
      (cond
       ((< r n) (setq result (insert x result predicate :key key)) (incf r))
       ((funcall predicate (funcall key x) (funcall key (car result)))
        (setq result (cdr (insert x result predicate :key key))))))
    (nreverse result)))

Here's an example of how it works:

After 5 items have been scanned result contains the items in reverse order:

(("two" . 3) ("one" . 3) ("four" . 4) ("zero" . 4) ("three" . 5))

The next item is:

("five" . 4)

This is larger than the first item in the list, so the first item is removed, and the new item is inserted at the correct position:

(("one" . 3) ("five" . 4) ("four" . 4) ("zero" . 4) ("three" . 5))

This process is repeated for all the remaining items in the list.

Testing the improved version

Let's run the same test data with the improved version:

CL-USER 2 > (time (best-n 5 *test-data* #'> :key #'cdr))

Timing the evaluation of (BEST-N 5 *TEST-DATA* (FUNCTION >) :KEY (FUNCTION CDR))

User time    =        0.020
Allocation   = 80040 bytes

The improved version is a factor of 10 faster and uses less than a tenth of the storage.

Next: Testing for a prefix

Previous: Infix to prefix

Most popular posts

About me

Feeds