About me

About me

Feeds

RSS feed

Finding the n best items in a list

3rd March 2015

For several of my web sites I wanted to show a small number of the most popular items in a particular category; for example, the top ten search terms, or the top 5 products. This let me to thinking about how to do this as efficiently as possible.

In other words, suppose you've got a list of items, each of which is associated with a popularity score; for example:

(("Greyhound" . 10) ("Wolfhound" . 12) ("Labrador" . 23) ("Poodle" . 7) ...)

What's the most efficient way of getting the best n items?

Simple solution

If we want the top n elements of the list, by score, the obvious simple-minded solution is to sort the list first, according to the scores, and then find the top five elements:

We first define a procedure to give the first n elements of a list as follows:

(defun first-n (n list)
  (if (> (length list) n)
    (subseq list 0 n)
    list))

The solution is then achieved by the following procedure best-n*:

(defun best-n* (n list predicate &key key)
  (first-n n (sort (copy-seq list) predicate :key key)))

However, we're sorting the whole list and then throwing away all but n of the items, which is very wasteful. There must be a more efficient way.

Testing the simple version

First let's create some test data:

(defparameter *test-data*
  (let (result)
    (dotimes (n 100000)
      (let ((string (format nil "~r" n)))
        (push (cons string (length string)) result)))
    (nreverse result)))

This creates an association list of the names of the first 100000 integers, and their lengths. So the first five items are:

(("zero" . 4) ("one" . 3) ("two" . 3) ("three" . 5) ("four" . 4) ...)

Testing the simple-minded version:

CL-USER 1 > (time (best-n* 5 *test-data* #'> :key #'cdr))

Timing the evaluation of (BEST-N* 5 *TEST-DATA* (FUNCTION >) :KEY (FUNCTION CDR))

User time    =        0.206
Allocation   = 1300100 bytes  

Improved version

A better approach is to scan through all the items, maintaining a list of just the best n items.

To do this we first define a helper function, insert, that inserts a new item into a sorted list at the correct position:

(defun insert (item list predicate &key key)
  (cond
   ((null list) 
    (list item))
   ((not (funcall predicate (funcall key item) (funcall key (car list)))) 
    (cons item list))
   (t 
    (cons (car list) (insert item (cdr list) predicate :key key)))))

Now here's the improved version of best-n. This works as follows:

It maintains a list of the top n items found so far, in result, with the lowest at the top. For each item in the list:

  • If there are fewer than n items in result, insert the new item at the correct sorted position in result.
  • Otherwise if the item beats the lowest item in result, drop the lowest item and insert it in the correct position.
  • Otherwise ignore it.

Here's the routine:

(defun best-n (n list predicate &key (key #'identity))
  (let ((r 0) result)
    (dolist (x list)
      (cond
       ((< r n) (setq result (insert x result predicate :key key)) (incf r))
       ((funcall predicate (funcall key x) (funcall key (car result)))
        (setq result (cdr (insert x result predicate :key key))))))
    (nreverse result)))

Here's an example of how it works:

After 5 items have been scanned result contains the items in reverse order:

(("two" . 3) ("one" . 3) ("four" . 4) ("zero" . 4) ("three" . 5))

The next item is:

("five" . 4)

This is larger than the first item in the list, so the first item is removed, and the new item is inserted at the correct position:

(("one" . 3) ("five" . 4) ("four" . 4) ("zero" . 4) ("three" . 5))

This process is repeated for all the remaining items in the list.

Testing the improved version

Let's run the same test data with the improved version:

CL-USER 2 > (time (best-n 5 *test-data* #'> :key #'cdr))

Timing the evaluation of (BEST-N 5 *TEST-DATA* (FUNCTION >) :KEY (FUNCTION CDR))

User time    =        0.020
Allocation   = 80040 bytes

The improved version is a factor of 10 faster and uses less than a tenth of the storage.


blog comments powered by Disqus