Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

Re: A Better Word Morph Builder

by Limbic~Region (Chancellor)
on Jun 29, 2006 at 16:55 UTC ( [id://558379]=note: print w/replies, xml ) Need Help??


in reply to A Better Word Morph Builder

One idea I had this morning but did not try out was searching the tree in both directions. Ieronim apparently had the same idea and indicated this is called a bi-directional search. I didn't find the implementation too difficult:

#!/usr/bin/perl use strict; use warnings; use Storable; my ($src, $tgt) = @ARGV; die "Usage: $0 <src> <tgt>" if ! defined $src || ! defined $tgt; die "The <src> and <tgt> must be same length" if length($src) != lengt +h($tgt); my $db = retrieve('dictionary.db'); my $path = find_path($src, $tgt, $db->{length($tgt)}); print "$path\n"; sub find_path { my ($src, $tgt, $list, $search) = @_; for my $pos (qw/src tgt/) { my $dir = $pos eq 'src' ? $src : $tgt; my $opp = $pos eq 'src' ? 'tgt' : 'src'; if (! defined $search->{$pos}{work}) { for (@{$list->{$dir}}) { push @{$search->{$pos}{work}}, {key => $_, path => "$d +ir-$_"}; $search->{$pos}{term}{$_} = $search->{$pos}{work}[-1]; } $search->{$pos}{term}{$dir} = {key => $dir, path => $dir}; } my ($work, $next) = ($search->{$pos}{work}, []); while (@$work) { my $node = shift @$work; my ($word, $path) = @{$node}{qw/key path/}; next if $search->{$pos}{seen}{$word}++; if ($search->{$opp}{term}{$word}) { my @cur_path = split /-/, $path; my @con_path = split /-/, $search->{$opp}{term}{$word} +{path}; return $pos eq 'tgt' ? join '-', @con_path, @cur_path[reverse 0 .. $#cu +r_path - 1] : join '-', @cur_path, @con_path[reverse 0 .. $#co +n_path - 1]; } for (@{$list->{$word}}) { push @$next, {key => $_, path => "$path-$_"}; $search->{$pos}{term}{$_} = $next->[-1]; } } $search->{$pos}{work} = $next; } return 'path not found' if ! @{$search->{src}{work}} || ! @{$searc +h->{tgt}{work}}; return find_path($src, $tgt, $list, $search); }
I have Benchmarked the results to determine if it is indeed faster. The results are as follows:
BFS_1way 5.93/s -- -64% -93% BFS_2way 16.5/s 179% -- -79% Ieronim_2way 79.2/s 1235% 379% --
Switching to bi-directional more than doubled the speed but didn't catch Ieronim once he started using the precompiled datastructure. I guess I need to figure out find out how transform() differs from my BFS.

Update (2006-07-03): After re-writing Ieronim's code I finally discovered what the difference was. In a nutshell, I pull at item off the work queue and test to see if it connects a path. If it does not, I add every item it does connect with to the work queue. Switching where I test for a connection to before they are added to the queue instead of after they are taken off solves the mystery.
This is not reflected in any of the benchmarks because I believe this dead horse sufficiently beaten.

Cheers - L~R

Replies are listed 'Best First'.
Re^2: A Better Word Morph Builder
by Ieronim (Friar) on Jun 29, 2006 at 18:51 UTC
    i modified transform() subroutine acording to your recommendation — i moved from dereferecing $list hashref to using it as a reference. The speed near doubled :)

    New benchmark results:

    Rate find_path find_path2 transform find_path 6.52/s -- -72% -96% find_path2 23.7/s 263% -- -87% transform 183/s 2702% 673% --
      Ieronim,
      I give. While I understand your admittedly fast code - it is extremely hard to follow. The cleanest code I could come up with using a bi-directional BFS is as fast as your transform() before following my recommendation. That is to say, it is about half as fast as your current version:
      use constant SRC => 0; use constant TGT => 1; sub find_path2 { my ($src, $tgt, $list) = @_; my (@src_work, @tgt_work, %path); for my $dir (SRC, TGT) { my ($word, $work) = $dir == SRC ? ($src, \@src_work) : ($tgt, +\@tgt_work); $path{$word}[$dir] = -1; for (@{$list->{$word}}) { push @$work, $_; $path{$_}[$dir] = $word; } } while (1) { for my $dir (SRC, TGT) { my @next; my $work = $dir == SRC ? \@src_work : \@tgt_work; for my $word (@$work) { return build_path(\%path, $word) if $path{$word}[abs($ +dir - 1)]; for (@{$list->{$word}}) { next if $path{$_}[$dir]; push @next, $_; $path{$_}[$dir] = $word; } } @$work = @next; } return 'Path not found' if ! @src_work && ! @tgt_work; } } sub build_path { my ($tree, $node) = @_; my $path = "-$node"; for my $dir (SRC, TGT) { my $word = $tree->{$node}[$dir]; while ($word ne '-1') { $path = $dir == SRC ? "-$word$path" : "$path-$word"; $word = $tree->{$word}[$dir]; } } return substr($path, 1); }
      Update (2006-07-03): Simply moving return build_path(\%path, $word) if $path{$word}[abs($dir - 1)]; to the inner most for loop and s/word/_/ brings the performance much closer. See the update in this node for an explanation why.

      Cheers - L~R

      Ieronim,
      I have finished benchmarking the routines. I didn't include any from this thread or this one but you are welcome to. I did not penalize any routine for processing the dictionary. Additionally, if there were a number of small variations for a routine, I only included the fastest.
      Rate limbic__2 solo____1 limbic__3 limbic__4 ieronim_1 lim +bic__1 limbic__2 6.48/s -- -45% -62% -90% -95% + -96% solo____1 11.8/s 82% -- -31% -82% -92% + -92% limbic__3 17.3/s 166% 46% -- -74% -88% + -89% limbic__4 67.5/s 941% 471% 291% -- -53% + -55% ieronim_1 143/s 2104% 1109% 728% 112% -- + -5% limbic__1 150/s 2218% 1171% 771% 123% 5% + --
      Please note that the winner is just my re-write of Ieronim's code with a few extra bells and whistles. Since the benchmark is extremely long, I have put it in spoiler tags as well as readmore tags:
      Update (2006-07-03): These benchmarks do not reflect the realization I made concerning when to test if a path connection has been made that I noted elsewhere in this thread.

      Cheers - L~R

        The results of the same benchmark (!!) on my machine are here:
        Rate limbic__2 solo____1 limbic__3 limbic__4 limbic__1 ier +onim_1 limbic__2 5.64/s -- -63% -77% -94% -97% + -97% solo____1 15.3/s 172% -- -37% -83% -92% + -92% limbic__3 24.2/s 329% 58% -- -74% -87% + -88% limbic__4 91.7/s 1528% 499% 279% -- -52% + -54% limbic__1 190/s 3267% 1140% 684% 107% -- + -5% ieronim_1 200/s 3449% 1207% 727% 118% 5% + --
        I can't explain the difference :))
      Ieronim,
      I have taken the best parts of all the versions and re-written them in what I believe to be relatively clean code. I have not yet benchmarked this but I expect it to be very close to your current best. Here is a list of features:
      • Command line argument parsing
      • Ability to handle non-dictionary words (configurable)
      • Ability to handle absense of Text::LevenshteinXS (automatic)
      • Ability to use different compiled databases (configurable)
      • Ability to create new compiled databases (configurable)
      • Speed and non-duplicated code

      Cheers - L~R

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://558379]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others surveying the Monastery: (2)
As of 2024-04-24 23:04 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found