Finding the n best items in a list
3rd March 2015
For several of my web sites I wanted to show a small number of the most popular items in a particular category; for example, the top ten search terms, or the top 5 products. This let me to thinking about how to do this as efficiently as possible.
In other words, suppose you've got a list of items, each of which is associated with a popularity score; for example:
(("Greyhound" . 10) ("Wolfhound" . 12) ("Labrador" . 23) ("Poodle" . 7) ...)
What's the most efficient way of getting the best n items?
Simple solution
If we want the top n elements of the list, by score, the obvious simple-minded solution is to sort the list first, according to the scores, and then find the top five elements:
We first define a procedure to give the first n elements of a list as follows:
(defun first-n (n list) (if (> (length list) n) (subseq list 0 n) list))
The solution is then achieved by the following procedure best-n*:
(defun best-n* (n list predicate &key key) (first-n n (sort (copy-seq list) predicate :key key)))
However, we're sorting the whole list and then throwing away all but n of the items, which is very wasteful. There must be a more efficient way.
Testing the simple version
First let's create some test data:
(defparameter *test-data* (let (result) (dotimes (n 100000) (let ((string (format nil "~r" n))) (push (cons string (length string)) result))) (nreverse result)))
This creates an association list of the names of the first 100000 integers, and their lengths. So the first five items are:
(("zero" . 4) ("one" . 3) ("two" . 3) ("three" . 5) ("four" . 4) ...)
Testing the simple-minded version:
CL-USER 1 > (time (best-n* 5 *test-data* #'> :key #'cdr)) Timing the evaluation of (BEST-N* 5 *TEST-DATA* (FUNCTION >) :KEY (FUNCTION CDR)) User time = 0.206 Allocation = 1300100 bytes
Improved version
A better approach is to scan through all the items, maintaining a list of just the best n items.
To do this we first define a helper function, insert, that inserts a new item into a sorted list at the correct position:
(defun insert (item list predicate &key key) (cond ((null list) (list item)) ((not (funcall predicate (funcall key item) (funcall key (car list)))) (cons item list)) (t (cons (car list) (insert item (cdr list) predicate :key key)))))
Now here's the improved version of best-n. This works as follows:
It maintains a list of the top n items found so far, in result, with the lowest at the top. For each item in the list:
- If there are fewer than n items in result, insert the new item at the correct sorted position in result.
- Otherwise if the item beats the lowest item in result, drop the lowest item and insert it in the correct position.
- Otherwise ignore it.
Here's the routine:
(defun best-n (n list predicate &key (key #'identity)) (let ((r 0) result) (dolist (x list) (cond ((< r n) (setq result (insert x result predicate :key key)) (incf r)) ((funcall predicate (funcall key x) (funcall key (car result))) (setq result (cdr (insert x result predicate :key key)))))) (nreverse result)))
Here's an example of how it works:
After 5 items have been scanned result contains the items in reverse order:
(("two" . 3) ("one" . 3) ("four" . 4) ("zero" . 4) ("three" . 5))
The next item is:
("five" . 4)
This is larger than the first item in the list, so the first item is removed, and the new item is inserted at the correct position:
(("one" . 3) ("five" . 4) ("four" . 4) ("zero" . 4) ("three" . 5))
This process is repeated for all the remaining items in the list.
Testing the improved version
Let's run the same test data with the improved version:
CL-USER 2 > (time (best-n 5 *test-data* #'> :key #'cdr)) Timing the evaluation of (BEST-N 5 *TEST-DATA* (FUNCTION >) :KEY (FUNCTION CDR)) User time = 0.020 Allocation = 80040 bytes
The improved version is a factor of 10 faster and uses less than a tenth of the storage.
blog comments powered by Disqus