Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

Re: A Better Word Morph Builder

by Limbic~Region (Chancellor)
on Jun 29, 2006 at 16:55 UTC ( #558379=note: print w/ replies, xml ) Need Help??


in reply to A Better Word Morph Builder

One idea I had this morning but did not try out was searching the tree in both directions. Ieronim apparently had the same idea and indicated this is called a bi-directional search. I didn't find the implementation too difficult:

#!/usr/bin/perl use strict; use warnings; use Storable; my ($src, $tgt) = @ARGV; die "Usage: $0 <src> <tgt>" if ! defined $src || ! defined $tgt; die "The <src> and <tgt> must be same length" if length($src) != lengt +h($tgt); my $db = retrieve('dictionary.db'); my $path = find_path($src, $tgt, $db->{length($tgt)}); print "$path\n"; sub find_path { my ($src, $tgt, $list, $search) = @_; for my $pos (qw/src tgt/) { my $dir = $pos eq 'src' ? $src : $tgt; my $opp = $pos eq 'src' ? 'tgt' : 'src'; if (! defined $search->{$pos}{work}) { for (@{$list->{$dir}}) { push @{$search->{$pos}{work}}, {key => $_, path => "$d +ir-$_"}; $search->{$pos}{term}{$_} = $search->{$pos}{work}[-1]; } $search->{$pos}{term}{$dir} = {key => $dir, path => $dir}; } my ($work, $next) = ($search->{$pos}{work}, []); while (@$work) { my $node = shift @$work; my ($word, $path) = @{$node}{qw/key path/}; next if $search->{$pos}{seen}{$word}++; if ($search->{$opp}{term}{$word}) { my @cur_path = split /-/, $path; my @con_path = split /-/, $search->{$opp}{term}{$word} +{path}; return $pos eq 'tgt' ? join '-', @con_path, @cur_path[reverse 0 .. $#cu +r_path - 1] : join '-', @cur_path, @con_path[reverse 0 .. $#co +n_path - 1]; } for (@{$list->{$word}}) { push @$next, {key => $_, path => "$path-$_"}; $search->{$pos}{term}{$_} = $next->[-1]; } } $search->{$pos}{work} = $next; } return 'path not found' if ! @{$search->{src}{work}} || ! @{$searc +h->{tgt}{work}}; return find_path($src, $tgt, $list, $search); }
I have Benchmarked the results to determine if it is indeed faster. The results are as follows:
BFS_1way 5.93/s -- -64% -93% BFS_2way 16.5/s 179% -- -79% Ieronim_2way 79.2/s 1235% 379% --
Switching to bi-directional more than doubled the speed but didn't catch Ieronim once he started using the precompiled datastructure. I guess I need to figure out find out how transform() differs from my BFS.

Update (2006-07-03): After re-writing Ieronim's code I finally discovered what the difference was. In a nutshell, I pull at item off the work queue and test to see if it connects a path. If it does not, I add every item it does connect with to the work queue. Switching where I test for a connection to before they are added to the queue instead of after they are taken off solves the mystery.
This is not reflected in any of the benchmarks because I believe this dead horse sufficiently beaten.

Cheers - L~R


Comment on Re: A Better Word Morph Builder
Select or Download Code
Replies are listed 'Best First'.
Re^2: A Better Word Morph Builder
by Ieronim (Friar) on Jun 29, 2006 at 18:51 UTC
    i modified transform() subroutine acording to your recommendation i moved from dereferecing $list hashref to using it as a reference. The speed near doubled :)

    New benchmark results:

    Rate find_path find_path2 transform find_path 6.52/s -- -72% -96% find_path2 23.7/s 263% -- -87% transform 183/s 2702% 673% --
      Ieronim,
      I give. While I understand your admittedly fast code - it is extremely hard to follow. The cleanest code I could come up with using a bi-directional BFS is as fast as your transform() before following my recommendation. That is to say, it is about half as fast as your current version:
      use constant SRC => 0; use constant TGT => 1; sub find_path2 { my ($src, $tgt, $list) = @_; my (@src_work, @tgt_work, %path); for my $dir (SRC, TGT) { my ($word, $work) = $dir == SRC ? ($src, \@src_work) : ($tgt, +\@tgt_work); $path{$word}[$dir] = -1; for (@{$list->{$word}}) { push @$work, $_; $path{$_}[$dir] = $word; } } while (1) { for my $dir (SRC, TGT) { my @next; my $work = $dir == SRC ? \@src_work : \@tgt_work; for my $word (@$work) { return build_path(\%path, $word) if $path{$word}[abs($ +dir - 1)]; for (@{$list->{$word}}) { next if $path{$_}[$dir]; push @next, $_; $path{$_}[$dir] = $word; } } @$work = @next; } return 'Path not found' if ! @src_work && ! @tgt_work; } } sub build_path { my ($tree, $node) = @_; my $path = "-$node"; for my $dir (SRC, TGT) { my $word = $tree->{$node}[$dir]; while ($word ne '-1') { $path = $dir == SRC ? "-$word$path" : "$path-$word"; $word = $tree->{$word}[$dir]; } } return substr($path, 1); }
      Update (2006-07-03): Simply moving return build_path(\%path, $word) if $path{$word}[abs($dir - 1)]; to the inner most for loop and s/word/_/ brings the performance much closer. See the update in this node for an explanation why.

      Cheers - L~R

      Ieronim,
      I have finished benchmarking the routines. I didn't include any from this thread or this one but you are welcome to. I did not penalize any routine for processing the dictionary. Additionally, if there were a number of small variations for a routine, I only included the fastest.
      Rate limbic__2 solo____1 limbic__3 limbic__4 ieronim_1 lim +bic__1 limbic__2 6.48/s -- -45% -62% -90% -95% + -96% solo____1 11.8/s 82% -- -31% -82% -92% + -92% limbic__3 17.3/s 166% 46% -- -74% -88% + -89% limbic__4 67.5/s 941% 471% 291% -- -53% + -55% ieronim_1 143/s 2104% 1109% 728% 112% -- + -5% limbic__1 150/s 2218% 1171% 771% 123% 5% + --
      Please note that the winner is just my re-write of Ieronim's code with a few extra bells and whistles. Since the benchmark is extremely long, I have put it in spoiler tags as well as readmore tags: Update (2006-07-03): These benchmarks do not reflect the realization I made concerning when to test if a path connection has been made that I noted elsewhere in this thread.

      Cheers - L~R

        The results of the same benchmark (!!) on my machine are here:
        Rate limbic__2 solo____1 limbic__3 limbic__4 limbic__1 ier +onim_1 limbic__2 5.64/s -- -63% -77% -94% -97% + -97% solo____1 15.3/s 172% -- -37% -83% -92% + -92% limbic__3 24.2/s 329% 58% -- -74% -87% + -88% limbic__4 91.7/s 1528% 499% 279% -- -52% + -54% limbic__1 190/s 3267% 1140% 684% 107% -- + -5% ieronim_1 200/s 3449% 1207% 727% 118% 5% + --
        I can't explain the difference :))
      Ieronim,
      I have taken the best parts of all the versions and re-written them in what I believe to be relatively clean code. I have not yet benchmarked this but I expect it to be very close to your current best. Here is a list of features:
      • Command line argument parsing
      • Ability to handle non-dictionary words (configurable)
      • Ability to handle absense of Text::LevenshteinXS (automatic)
      • Ability to use different compiled databases (configurable)
      • Ability to create new compiled databases (configurable)
      • Speed and non-duplicated code

      Cheers - L~R

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://558379]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others meditating upon the Monastery: (16)
As of 2015-07-31 15:45 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (279 votes), past polls