Challenge: Find median string in a list

Limbic~Region has asked for the wisdom of the Perl Monks concerning the following question:

All,
This challenge was posed last night in #perl IRC on freenode.

Given:

A large list (circa 15K) of unique strings
Composed of the character set [a-z0-9]
With lengths varying from 1 to 24 characters

Determine which string is the median string and the distance from each string to the median string.

For the purposes of this challenge, distance is the Levenshtein Distance. The median string is the string where the distance between itself and the furthest away string in the list is minimized. In other words, if you were to calculate the distance of the string furthest away for every string in the list, the median string would be the one with the shortest distance.

I recognize that my use of median is likely incorrect. In the event of a conflict, refer to the definition here as the intended one. There may be more than one string that fits the defintion of median string, in this case only 1 such string is required.

If you come up with a great solution that is not very memory efficient feel free to post it but if you can, limit memory consumption to 750MB. I have a feeling a trie may provide for a better solution, but the best I can come up with so far is below:

#!/usr/bin/perl
use strict;
use warnings;
use Text::LevenshteinXS;
use constant WORD => 0;
use constant LEN  => 1;
use constant MAX  => 10_000_000;

my @word;
while (<DATA>) {
    tr/\r\n//d;
    push @word, [$_, 0];
}
@word = sort {length($b->[WORD]) <=> length($a->[WORD]) || $a->[WORD] 
+cmp $b->[WORD]} @word;

my ($min_word, $min_dist) = ('', MAX);
for my $i (0 .. $#word) {
    next if $word[$i][LEN] >= $min_dist;
    for my $j (0 .. $#word) {
        next if $i == $j;
        my $dist = distance($word[$i][WORD], $word[$j][WORD]);        
        $word[$i][LEN] = $dist if $dist > $word[$i][LEN];
        $word[$j][LEN] = $dist if $dist > $word[$j][LEN];
        last if $dist >= $min_dist;
    }
    ($min_word, $min_dist) = ($word[$i][WORD], $word[$i][LEN]) if $wor
+d[$i][LEN] < $min_dist;
}
for (@word) {
    # Obviously this would need to be stored
    my $dist = distance($_, $min_word);
}
print "$min_word\t$min_dist\n";
__DATA__
[download]

Update: ysth suggested a number of optimizations that I had already tried. He had the sense to try them together which makes it blazingly fast. While the algorithm is mine, credit for the optimizations go to him.

Cheers - L~R

Comment on Challenge: Find median string in a list Select or Download Code

Replies are listed 'Best First'.
Re: Challenge: Find median string in a list by blokhead (Monsignor) on Jul 04, 2007 at 19:18 UTC
This problem in graph theory is known as the graph center problem. So here's a cop-out solution using Graph.pm: use strict; use warnings; use Text::LevenshteinXS; use Graph; my $data = lc qq[Lorem ipsum dolor sit amet, consectetuer adipiscing e +lit. Mauris vulputate nunc. Pellentesque pretium. Nam tortor. Vivamus sed + eros ut arcu consectetuer auctor. Nunc ultrices nisi. Phasellus congue se +m quis nulla. Sed consequat. Etiam consectetuer. Sed ultricies libero at ma +gna commodo nonummy. Sed quis arcu. Integer massa lectus, ultrices non, +faucibus eget, sagittis eget, lacus. ]; my @words = $data =~ m/\w+/g; my $g = Graph::Undirected->new; for my $i (0 .. $#words) { for my $j ($i+1 .. $#words) { $g->add_weighted_edge( @words[$i,$j], distance(@words[$i,$j]) ); } } my $apsp = $g->all_pairs_shortest_paths; my ($center, $radius) = (undef, 1_000_000); for (@words) { my $x = $apsp->vertex_eccentricity($_); ($center, $radius) = ($_, $x) if $x < $radius; } print "$center ($radius)\n"; [download] I'm sure there should be a more efficient algorithm for graph center than computing all pairs shortest paths (although it makes a difference if the bottleneck is computing Levenshtein distance or dealing with the big graph), but this is a starting point. Especially interesting would be an algorithm that explored the graph in an "on-line" fashion to avoid precomputing all the pairwise distances. BTW, Graph.pm does offer a `center_vertices` method for the APSP object, but it appears to be broken (at least in my version of the module). blokhead	[reply] [d/l] [select]
Re^2: Challenge: Find median string in a list by Limbic~Region (Chancellor) on Jul 05, 2007 at 01:05 UTC
blokhead, This solution took nearly 5 hours (compared to 0.28 seconds). I am not sure how well it would have done if center_vertices would have worked. Cheers - L~R	[reply]
Re: Challenge: Find median string in a list by ysth (Canon) on Jul 04, 2007 at 22:16 UTC
The algorithm you have is more than adequate. Putting back in the remaining optimization and fixing the >'s gives: #!/usr/bin/perl use strict; use warnings; use Text::LevenshteinXS; use constant WORD => 0; use constant LEN => 1; use constant MAX => 10_000_000; my @word; while (<DATA>) { tr/\r\n//d; push @word, [$_, 0]; } @word = sort {$a->[WORD] cmp $b->[WORD]} @word; my ($min_word, $min_dist) = ('', MAX); for my $i (0 .. $#word) { next if $word[$i][LEN] >= $min_dist; for my $j (0 .. $#word) { my $dist = distance($word[$i][WORD], $word[$j][WORD]); $word[$i][LEN] = $dist if $dist > $word[$i][LEN]; $word[$j][LEN] = $dist if $dist > $word[$j][LEN]; last if $dist >= $min_dist; } ($min_word, $min_dist) = ($word[$i][WORD], $word[$i][LEN]) if $wor +d[ +$i][LEN] < $min_dist; } print "$min_word\t$min_dist\n"; __DATA__ [download] Thinking about ways to speed it up, it seemed to me that the inner loop should first compare against words suspected to be far from the center, so as to break out as early as possible. I thought about picking random combinations of words and saving the pairs with the greatest distance. Or picking a word, finding the word with the greatest distance from it, finding the word with the greatest distance from that, repeat, then using the last pair of words to test against first in the inner loop. But it turns out that, with the optimizations you had both in, sorting the wordlist by decreasing word length gets the number of distance() calls down to just 86006 on your sample data with 14300 words, in a very short period of time: `@word = sort {length($b->[WORD]) <=> length($a->[WORD])} @word;` [download]	[reply] [d/l] [select]
Re^2: Challenge: Find median string in a list by Limbic~Region (Chancellor) on Jul 05, 2007 at 01:26 UTC
ysth, At one time or another, I had tried all the tweaks you had provided and wrote them off as micro-optimizations. Combined together they really do give impressive results. Thanks for suggesting I have a second look. Cheers - L~R	[reply]
Re: Challenge: Find median string in a list by ysth (Canon) on Jul 04, 2007 at 18:34 UTC
Looks pretty good to me. Your > could be >= in the "next if" and "last if"s. Update: meant >/>=, not </<=.	[reply]
Re^2: Challenge: Find median string in a list by Limbic~Region (Chancellor) on Jul 04, 2007 at 19:06 UTC
ysth, It turns out one of those was not safe for the reason listed below: The N^2 is reduce to N^2 / 2 by assuming that every string up to the current string has already been tested against the current string and all that need be tested is the current string against all later strings. If you know that the current string can't be the median string, it is not safe to skip it because it may also disqualify later strings. Without doing the work, you can't know for sure. Cheers - L~R	[reply]
Re^3: Challenge: Find median string in a list by ysth (Canon) on Jul 04, 2007 at 19:56 UTC
As we discussed in the chatterbox, and you determined experimentally, it's better to remove that assumption and make the inner loop go over all the strings, but keep the bailing out when the previously determined minimum maximum distance (I chortle as I write that) is met or exceeded. The current iteration of your code still has the "next if" commented out; I don't see a reason for this.	[reply]
Re^4: Challenge: Find median string in a list by Limbic~Region (Chancellor) on Jul 05, 2007 at 01:09 UTC


"be consistent"
	PerlMonks