About me

About me

Feeds

RSS feed

Number-aware string sort

17th November 2017

Sorting using the standard Lisp string comparison functions, such as string< and the case-insensitive version string-lessp, arranges strings containing sequences of digits in alphabetical order, ignoring the numbers represented by the digits. For example, if we have a list of strings *squares* equal to:

("Sq36" "Sq64" "Sq81" "Sq4" "Sq25" "Sq1" "Sq100" "Sq16" "Sq9" "Sq49")

then:

(sort *squares* #'string-lessp)

gives:

("Sq1" "Sq100" "Sq16" "Sq25" "Sq36" "Sq4" "Sq49" "Sq64" "Sq81" "Sq9")

which is the order that the strings would appear in a dictionary.

However, for many applications, such as sorting filenames, we might prefer a more logical sort order which takes account of the numeric ordering of the numbers represented by the digits. How much work is it to produce versions of the string comparison routines to accomodate this feature?

For example, we want a routine string-digits-lessp which treats sequences of digits within the strings as numbers, and compares these numerically, so:

(sort *squares* #'string-digits-lessp)

should give:

("Sq1" "Sq4" "Sq9" "Sq16" "Sq25" "Sq36" "Sq49" "Sq64" "Sq81" "Sq100")

At first sight it looks as if we might have to parse the sequence of digits, convert them into the corresponding integer, and then compare these. However, what size integer do we need to store the value? The strings could contain a sequence of 100 digits.

Before I give the answer I came up with you may like to try the problem yourself.

The solution

Fortunately the solution turns out to be much easier than you might expect, and requires a fairly small modification to the standard string comparison routine.

As a first step in approaching this problem I wrote my own simple implementation of string-lessp:

(defun string-lessp* (s1 s2 &key (start1 0) (start2 0)
                         (end1 (length s1)) (end2 (length s2)))
  (let ((lt t) (gt nil) (eq nil))
    (loop
     (when (and (= start1 end1) (= start2 end2)) (return eq))
     (when (= start1 end1) (return lt))
     (when (= start2 end2) (return gt))
     (when (char-lessp (char s1 start1) (char s2 start2)) (return lt))
     (when (char-greaterp (char s1 start1) (char s2 start2)) (return gt))
     (incf start1) (incf start2))))

My version string-lessp* isn't quite identical to string-lessp; for example, it returns t rather than the index of the first mismatch. But for practical purposes it's equivalent, and on LispWorks it appears to run as quickly as the system function.

To accommodate strings of digits the approach is as follows:

  • As we are scanning the strings, testing the alphabetic ordering of pairs of characters, check whether the characters we are comparing are both digits.
  • If so, scan past the sequence of digits in each string, until we either get to the end of the string, or a non-digit character.
  • If one sequence of digits is longer than the other one, this is then the greater string, and we can return immediately.
  • Otherwise resume testing the alphabetic ordering of pairs of characters from the start of each sequence of digits.

Here's the implementation; it uses two additional pointers d1 and d2 to span each sequence of digits:

(defun string-digits-lessp (s1 s2 &key (start1 0) (start2 0)
                               (end1 (length s1)) (end2 (length s2)))
  (let ((lt t) (gt nil) (eq nil) (d1 0) (d2 0))
    (loop
     (when (and (= start1 end1) (= start2 end2)) (return eq))
     (when (= start1 end1) (return lt))
     (when (= start2 end2) (return gt))
     ;;
     (when 
         (and
          (> start1 d1) 
          (> start2 d2) 
          (digit-char-p (char s1 start1)) 
          (digit-char-p (char s2 start2)))
       (setq d1 (1+ start1)) (setq d2 (1+ start2))
       (loop 
        (when (or (= d1 end1) (not (digit-char-p (char s1 d1)))) (return))
        (incf d1))
       (loop 
        (when (or (= d2 end2) (not (digit-char-p (char s2 d2)))) (return))
        (incf d2)) 
       (when (> (- d1 start1) (- d2 start2)) (return gt))
       (when (< (- d1 start1) (- d2 start2)) (return lt)))
     ;;
     (when (char-lessp (char s1 start1) (char s2 start2)) (return lt))
     (when (char-greaterp (char s1 start1) (char s2 start2)) (return gt))
     (incf start1) (incf start2))))

Testing it:

> (sort *squares* #'string-digits-lessp)
("Sq1" "Sq4" "Sq9" "Sq16" "Sq25" "Sq36" "Sq49" "Sq64" "Sq81" "Sq100")

It also works for strings containing multiple sequences of digits, such as:

> (string-digits-lessp "Volume100File9" "Volume100File10")
T

blog comments powered by Disqus