Friday, September 12, 2008

Sorting strings with embedded numbers

Suppose you have a list of file names:
picture1.txt
picture2.txt
picture3.txt
picture4.txt
picture5.txt
picture6.txt
picture7.txt
picture8.txt
picture9.txt
picture10.txt

Lexicographic order will sort picture10.txt before picture2.txt, which is not what you want. The correct solution is to recognize that the strings contain numbers, and special-case their comparison.

Doug Coleman wrote a Factor solution last December.

Today I was revisiting his code and noticed that it could be simplified significantly using Chris Double's amazing parser expression grammars library:
USING: peg.ebnf math.parser kernel assocs sorting ;
IN: sorting.human

: find-numbers ( string -- seq )
[EBNF Result = ([0-9]+ => [[ string>number ]] | (!([0-9]) .)+)* EBNF] ;

: human-sort ( seq -- seq' )
[ dup find-numbers ] { } map>assoc sort-values keys ;

I checked the code into the repository; you can USE: sorting.human to use it.

A bit of explanation is in order.

The PEG [0-9]+ => [[ string>number ]] matches one or more digits (in the range 0 to 9), and applies the action string>number (a Factor word which does what the name says).

The PEG (!([0-9]) .)+ matches one or more characters which are not digits.

The PEG idiom (A | B)* matches a sequence of zero or more A's and B's, which is exactly what we do here with the above two PEGs.

The PEG parser is wrapped in [EBNF ... EBNF], which is "inline PEG notation"; it expands into code which takes a string as input and returns a parse result as output. So calling find-numbers splits out any embedded numbers in the string.

The phrase [ dup find-numbers ] { } map>assoc builds an association list from a sequence where each sequence element is associated with a tokenized string. We use sort-values to sort the association list by comparing values, then use keys to extract the sequence elements, now in sorted order. Beautiful.

I solved the same problem in Java several years ago. The result was not pretty at all:
public static int compareStrings(String str1, String str2, boolean ignoreCase)
{
char[] char1 = str1.toCharArray();
char[] char2 = str2.toCharArray();

int len = Math.min(char1.length,char2.length);

for(int i = 0, j = 0; i < len && j < len; i++, j++)
{
char ch1 = char1[i];
char ch2 = char2[j];
if(Character.isDigit(ch1) && Character.isDigit(ch2)
&& ch1 != '0' && ch2 != '0')
{
int _i = i + 1;
int _j = j + 1;

for(; _i < char1.length; _i++)
{
if(!Character.isDigit(char1[_i]))
break;
}

for(; _j < char2.length; _j++)
{
if(!Character.isDigit(char2[_j]))
break;
}

int len1 = _i - i;
int len2 = _j - j;
if(len1 > len2)
return 1;
else if(len1 < len2)
return -1;
else
{
for(int k = 0; k < len1; k++)
{
ch1 = char1[i + k];
ch2 = char2[j + k];
if(ch1 != ch2)
return ch1 - ch2;
}
}

i = _i - 1;
j = _j - 1;
}
else
{
if(ignoreCase)
{
ch1 = Character.toLowerCase(ch1);
ch2 = Character.toLowerCase(ch2);
}

if(ch1 != ch2)
return ch1 - ch2;
}
}

return char1.length - char2.length;
}

This code really sucks. The solution is very imperative and low-level; to understand what it does, you have to execute the code in your head and keep track of the various loops, early exits, and mutable local variables. I no longer try to solve problems in these low-level terms. With Factor I can solve the problem without "playing CPU". A lot of people, when they're first starting out with Factor, try to write code much like the above using a stack. Predictably, the stack shuffling gets out of hand. While the above algorithm could be implemented quite easily in Factor using locals, it would still be 20 times longer and more complicated than the high-level solution using PEGs.

Understanding the various abstractions and libraries which are available is key to grasping Factor: collections, generic words, fry, locals, macros, memoization, PEGs, the prettyprinter, and so on. Making effective use of these tools can reduce the amount of work required to solve a problem by an order of magnitude. This is why I invest a lot of effort into writing documentation, as well as the help system itself. Making Factor easy to learn and explore is a high priority for me.

In the Java community, there is a certain derision against "API monkeys": programmers who always reach for a library without being able to solve a problem using just the language itself. Factor programmers on the other hand should be API monkeys and proud of it!

3 comments:

Wolf550e said...

You did not say anything about space and time complexities of the two pieces of code. Also, the EBNF thing looks comparable to built-in regular expressions support offered by other languages. At least throw in some timings on some sample data, against jvm 1.6 on the same machine.

I realize that if I'm curious I should be finding that out myself, but still - talking about code style without even saying what is the cost of the achieved terseness/readability is not as meaningful. The post will sounds so much better if you add something like "and the shorter Factor version outperforms the java code" or at least "is within x constant factor of the java code, with the same asymptotic behavior".

Anonymous said...

[ dup find-numbers ] { } map>assoc sort-values keys

This is known as sort_by in Ruby. Sorting a list of strings by length:

strings.sort_by(&:length)

human-sort with sort-by:

: human-sort ( seq -- seq' )
[ find-numbers ] sort-by ;

lee said...

Speaking of Ruby, a first guess at what you're trying to do would be (in that language)

%w[picture3 picture20 picture1004 picture1].sort_by {|item| item.scan(/[0-9]+|[^0-9]+/).map {|s| s[/[^0-9]/] ? s : s.to_i} }

so you split each string up into runs of numbers or runs of non-numbers, numerify the numbers, and sort array-wise (component by component).