Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

A Better Word Morph Builder

by Limbic~Region (Chancellor)
on Jun 29, 2006 at 15:30 UTC ( #558342=perlquestion: print w/ replies, xml ) Need Help??
Limbic~Region has asked for the wisdom of the Perl Monks concerning the following question:

All,
In this thread, Ieronim posted a cool script. It finds the shortest bridge between two words where all the words in the bridge can be found in a dictionary and each word differs from its adjacent partners by only one character.

I posted what I believe to be a very fast elegant solution:

#!/usr/bin/perl use strict; use warnings; use Storable; my ($src, $tgt) = @ARGV; die "Usage: $0 <src> <tgt>" if ! defined $src || ! defined $tgt; die "The <src> and <tgt> must be same length" if length($src) != lengt +h($tgt); my $db = retrieve('dictionary.db'); my $path = find_path($src, $tgt, $db->{length($tgt)}); print "$path\n"; sub find_path { my ($src, $tgt, $list, $seen, $work) = @_; @$work = map {key => $_ => path => "$src->$_"}, @{$list->{$src}} i +f ! defined $work; my $next = []; for (@$work) { my ($word, $path) = @{$_}{qw/key path/}; next if $seen->{$word}++; return $path if $word eq $tgt; push @$next, map {key => $_, path => "$path->$_"}, @{$list->{$ +word}}; } return find_path($src, $tgt, $list, $seen, $next) if @$next; return 'path not found'; }

The algorithm is simple. First, create a datastructure that relates all words of a given length to all other words of the same length with a Levenshtein Distance of 1. This is a one time up-front cost. Next, simply perform a breadth first search starting from one end and stopping when the other end is encountered.

With the hard work done up front, the task can be accomplished in about 20 lines of code. Is anyone aware of a more efficient (faster run-time) algorithm? As usual, this is just one of my "I want to learn" questions and there is no real-world speed barrier I am trying to break. It just seemed to me that there should be some way to eliminate some paths during the BFS. I think an evolutionary algorithm technique might work. Paths that increase the Levenshtein Distance to the target word would not be allowed to survive. Even if this does work, I am not sure that the reduction in search space is worth the added distance calculation.

I appreciate any ideas on this.

* The original did allow for the endpoints not to be dictionary words.

Cheers - L~R

Comment on A Better Word Morph Builder
Download Code
Re: A Better Word Morph Builder
by Ieronim (Friar) on Jun 29, 2006 at 16:46 UTC
    I rewrote my script using the pre-caching of word relationships; saying simply, i rewrote it using your 'dictionary.db' datastructure. I'll post the updated variant to my node Play and win the word morph game with the help of Perl :) Here is the benchmarking script comparing speed of your find_path and my modified transform functions: And here is the result of it's execution:
    love-lose-lost-loot-soot-shot-shit love->lore->sore->sort->soot->shot->shit Rate find_path transform find_path 6.76/s -- -94% transform 108/s 1505% --
    This result illustrates that bidirectional search is in general much faster than breadth-first search. But your function looks much prettier and takes only 12 lines of code :)
Re: A Better Word Morph Builder
by Limbic~Region (Chancellor) on Jun 29, 2006 at 16:55 UTC

    One idea I had this morning but did not try out was searching the tree in both directions. Ieronim apparently had the same idea and indicated this is called a bi-directional search. I didn't find the implementation too difficult:

    I have Benchmarked the results to determine if it is indeed faster. The results are as follows:
    BFS_1way 5.93/s -- -64% -93% BFS_2way 16.5/s 179% -- -79% Ieronim_2way 79.2/s 1235% 379% --
    Switching to bi-directional more than doubled the speed but didn't catch Ieronim once he started using the precompiled datastructure. I guess I need to figure out find out how transform() differs from my BFS.

    Update (2006-07-03): After re-writing Ieronim's code I finally discovered what the difference was. In a nutshell, I pull at item off the work queue and test to see if it connects a path. If it does not, I add every item it does connect with to the work queue. Switching where I test for a connection to before they are added to the queue instead of after they are taken off solves the mystery.
    This is not reflected in any of the benchmarks because I believe this dead horse sufficiently beaten.

    Cheers - L~R

      i modified transform() subroutine acording to your recommendation i moved from dereferecing $list hashref to using it as a reference. The speed near doubled :)

      New benchmark results:

      Rate find_path find_path2 transform find_path 6.52/s -- -72% -96% find_path2 23.7/s 263% -- -87% transform 183/s 2702% 673% --
        Ieronim,
        I give. While I understand your admittedly fast code - it is extremely hard to follow. The cleanest code I could come up with using a bi-directional BFS is as fast as your transform() before following my recommendation. That is to say, it is about half as fast as your current version:
        use constant SRC => 0; use constant TGT => 1; sub find_path2 { my ($src, $tgt, $list) = @_; my (@src_work, @tgt_work, %path); for my $dir (SRC, TGT) { my ($word, $work) = $dir == SRC ? ($src, \@src_work) : ($tgt, +\@tgt_work); $path{$word}[$dir] = -1; for (@{$list->{$word}}) { push @$work, $_; $path{$_}[$dir] = $word; } } while (1) { for my $dir (SRC, TGT) { my @next; my $work = $dir == SRC ? \@src_work : \@tgt_work; for my $word (@$work) { return build_path(\%path, $word) if $path{$word}[abs($ +dir - 1)]; for (@{$list->{$word}}) { next if $path{$_}[$dir]; push @next, $_; $path{$_}[$dir] = $word; } } @$work = @next; } return 'Path not found' if ! @src_work && ! @tgt_work; } } sub build_path { my ($tree, $node) = @_; my $path = "-$node"; for my $dir (SRC, TGT) { my $word = $tree->{$node}[$dir]; while ($word ne '-1') { $path = $dir == SRC ? "-$word$path" : "$path-$word"; $word = $tree->{$word}[$dir]; } } return substr($path, 1); }
        Update (2006-07-03): Simply moving return build_path(\%path, $word) if $path{$word}[abs($dir - 1)]; to the inner most for loop and s/word/_/ brings the performance much closer. See the update in this node for an explanation why.

        Cheers - L~R

        Ieronim,
        I have taken the best parts of all the versions and re-written them in what I believe to be relatively clean code. I have not yet benchmarked this but I expect it to be very close to your current best. Here is a list of features:
        • Command line argument parsing
        • Ability to handle non-dictionary words (configurable)
        • Ability to handle absense of Text::LevenshteinXS (automatic)
        • Ability to use different compiled databases (configurable)
        • Ability to create new compiled databases (configurable)
        • Speed and non-duplicated code

        Cheers - L~R

        Ieronim,
        I have finished benchmarking the routines. I didn't include any from this thread or this one but you are welcome to. I did not penalize any routine for processing the dictionary. Additionally, if there were a number of small variations for a routine, I only included the fastest.
        Rate limbic__2 solo____1 limbic__3 limbic__4 ieronim_1 lim +bic__1 limbic__2 6.48/s -- -45% -62% -90% -95% + -96% solo____1 11.8/s 82% -- -31% -82% -92% + -92% limbic__3 17.3/s 166% 46% -- -74% -88% + -89% limbic__4 67.5/s 941% 471% 291% -- -53% + -55% ieronim_1 143/s 2104% 1109% 728% 112% -- + -5% limbic__1 150/s 2218% 1171% 771% 123% 5% + --
        Please note that the winner is just my re-write of Ieronim's code with a few extra bells and whistles. Since the benchmark is extremely long, I have put it in spoiler tags as well as readmore tags:
        Update (2006-07-03): These benchmarks do not reflect the realization I made concerning when to test if a path connection has been made that I noted elsewhere in this thread.

        Cheers - L~R

Re: A Better Word Morph Builder
by Solo (Deacon) on Jun 29, 2006 at 16:58 UTC
    Do you mean Hamming distance?

    Crazy idea. Since one can compute the Hamming distance between the initial words, and this is a mininum length of any possible solution path, you can eliminate any paths of shorter length as solutions. Of course, you'd need to know the paths to know the lengths, so this may not be a realizable optimization, but at least you could avoid testing for the solution until you're past the minimum distance. Just trade space for speed and precompute every path.

    --Solo

    --
    You said you wanted to be around when I made a mistake; well, this could be it, sweetheart.
      Solo,
      My EA idea, which is likely flawed, would work as follows:

      Assuming you start with a distance of 7 and the current word is a distance of 3 - any paths that would increase this distance above 3 wouldn't be allowed to survive.

      Do you mean Hamming distance?

      Since we are only using valid words, no path could ever be shorter than the Hamming distance.

      ...but at least you could avoid testing for the solution until you're past the minimum distance.

      love - shit (Hamming Distance = 4) love-lore-sore-sort-soot-shot-shit (Actual Distance = 6)

      Not testing may seem like a win but I doubt it. This is because there is no significant difference between $cur_dist > $min_dist and $word eq $tgt.

      Just trade space for speed and precompute every path

      If you mean precompute every path from every word to any other word it can reach, this certainly would be faster runtime. Unfortunately, I doubt this would be practical. If you mean precompute all paths from the source and then only look at the paths that are beyond the minimum distance, this is a gamble because you won't know how many paths exist beyond the actual solution. The current method is to abort searching when a path is found.

      Cheers - L~R

Re: A Better Word Morph Builder
by Solo (Deacon) on Jun 29, 2006 at 19:44 UTC
    Well, it certainly takes a while to build the db the first time... (over 15 mins so far... maybe LevenshteinXS isn't quite right on Win?) Here's my offing that was written for clarity, rather than efficiency. Then again, maybe I'll say it's optimized for startup time, rather than solution time... who likes those long 'loading...' screens, anyway?

    Update:I should have realized that I'm using the yawl-0.3.2 word list, and the time to build the db is probably affected by that!

    use strict; use warnings; use List::Compare; use Benchmark qw( cmpthese ); my $lword = shift || 'love'; my $rword = shift || 'hate'; die unless length($rword) == length($lword); my $length = length($lword); my $wordlist; open(FH,'word.list') or die; while(<FH>) { chomp; undef($wordlist->{lc $_}) if length($_) == $length; } print solo($lword,$rword); #cmpthese( ... ); sub solo { my ( $rdepth,$ldepth,$rex,$lex ) = (0,0,1,1); my $left = { $_[0] => [$_[0]] }; my $right = { $_[1] => [$_[1]] }; while ( 1 ) { # compare the intersection of the leaf nodes my $lc = List::Compare->new( { lists => [ [keys %$left], [keys %$right] ], accelerated => 1, unsorted => 1, } ); my @int = $lc->get_intersection(); if ( @int ) { my $solve = shift @int; pop @{$right->{$solve}}; return join("\n", @{$left->{$solve}}, reverse @{$right->{$ +solve}}) . "\n"; } # pick a side to expand the search in if ( $lex && $rdepth > $ldepth ) { $lex = expandTree($left,++$ldepth); } elsif ( $rex ) { $rex = expandTree($right,++$rdepth); } else { return "No solution."; } } } sub expandTree { my ($tree, $depth) = @_; my ($word, $path, $expanded); while ( ($word, $path) = each %$tree ) { if ( @$path == $depth ) { for my $i (0..(length($word)-1) ) { for my $letter ( 'a'..'z' ) { my $try = $word; substr($try,$i,1) = $letter; next if exists $tree->{$try}; if ( exists $wordlist->{$try} ) { $tree->{$try} = [ @{$tree->{$word}}, $try ]; $expanded = 1; } } } } } return $expanded; }

    --Solo

    --
    You said you wanted to be around when I made a mistake; well, this could be it, sweetheart.
      I have benchmarked your method. If i am not mistaken, is's another example of bidirectional search, isn't it?

      Benchmark results:

      Rate find_path solo find_path2 transform find_path 5.91/s -- -62% -75% -97% solo 15.7/s 166% -- -35% -92% find_path2 24.1/s 308% 53% -- -87% transform 188/s 3084% 1097% 681% --
      Note: Your solution seems to be suitable only for basic English charset:
      for my $letter ( 'a'..'z' ) {
      And what about the Unicode? ;)
        Thanks for the benchmark! You're right, it is bi-directional. And you caught me, 'a'..'z' was a precomputed optimization based on the set of characters found in my wordlist. ;) Does 2of12 include additional characters?


        --Solo

        --
        You said you wanted to be around when I made a mistake; well, this could be it, sweetheart.
      Solo,
      Thanks! I also benchmarked your solution. You can see how you did here.

      Cheers - L~R

Re: A Better Word Morph Builder
by bobf (Monsignor) on Jun 29, 2006 at 20:23 UTC

    FWIW, similar solutions to word ladder puzzles can be found in the previous discussions Solving word ladder puzzles and Solve Word Ladders. While the implementations may be slightly different, the end goal is the same (well, nearly - the focus is solving the puzzle in a set number of steps rather than the smallest number of them).

      bobf,
      Thanks for the links! It appears that word ladder differs slightly from word morph in that, according to Wikipedia, letters can be removed or added in addition to just changing them. In any case, I would love to add some of the approaches in these links to the list of benchmarks except they don't lend themselves to just copy/pasting a function.

      Cheers - L~R

Re: A Better Word Morph Builder
by ambrus (Abbot) on Oct 11, 2011 at 18:32 UTC
      ambrus,
      The only notable difference between the two is this thread deals with the shortest bridge/ladder as requested by Ieronim while the QoTW allows any length bridge/ladder. If you see the benchmarks in the thread, we optimized the heck out of it.

      Cheers - L~R

        Like Daniel Martin notes in the analysis of the qotw solutions, many of the solutions (though not mine) found the shortest ladder even though the task didn't ask for that. This is no wonder: qotw is a friendly list where the intent is to learn rather than to compete.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://558342]
Approved by marto
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (3)
As of 2014-07-24 02:40 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (156 votes), past polls