http://www.perlmonks.org?node_id=712402


in reply to Challenge: "Words" In A String

Very rough draft (pseudocode)
sub wordify($string) { my $results = []; $min_end = length($string) for $pos (0 .. $min_end) { for $len (1 .. length($string)) { $word = substr($string, $pos, $len) if exists($dictionary{$word}) { $min_end = $pos+$len if $pos+$len < $min_end $prefix = substr($string, 0, $pos) . " " . $word # nee +ds special treatment for $pos = 0 $endings = wordify(substr($string, $pos+1)) push(@$results, map("$prefix $_", @$endings)) } } } return [$string] unless @$results return $results }
Hope I don't get it all wrong, it's quite late here in Europe. I'd definitely use dynamic programming to save the partial results.

Replies are listed 'Best First'.
Re^2: Challenge: "Words" In A String
by pjotrik (Friar) on Sep 19, 2008 at 10:21 UTC
    All right, here goes the code. I've kept it simple, using trie instead of hash as ikegami does would surely improve performance.
    #!/usr/bin/perl use warnings; use strict; my %dictionary; open(my $dict_file, '<', '2of12inf.txt') or die ("Can't open dictionar +y: $!"); while (<$dict_file>) { $_ = lc($_); tr/a-z//dc; $dictionary{$_} = 1; } my @partials; while (<DATA>) { $_ = lc($_); tr/a-z//dc; @partials = (); $partials[0] = ['']; my $results = wordify($_); print "$_:\n"; print "$_\n" for @$results; } sub wordify { my ($string) = @_; my $results = []; my $min_end = length($string); for (my $pos = 0; $pos < $min_end; $pos++) { for my $len (1 .. length($string) - $pos) { my $word = substr($string, $pos, $len); if (exists $dictionary{$word}) { $min_end = $pos+$len if $pos+$len < $m +in_end; my $prefix = ''; $prefix = '[' . substr($string, 0, $po +s) . '] ' unless $pos == 0; $prefix .= $word; my $rest = substr($string, $pos + $len +); wordify($rest) unless (defined $partia +ls[length($rest)]); my $endings = $partials[length($rest)] +; push(@$results, map("$prefix $_", @$en +dings)); } } } $results = ["[$string]"] unless @$results; $partials[length($string)] = $results; return $results; } __DATA__ penisland zatxtaz xapenx
    Giving the result (brackets denote the non-word fragments)
    penisland: pen is la [nd] pen is land pen is [l] an [d] pen is [l] and pen island penis la [nd] penis land penis [l] an [d] penis [l] and [p] en is la [nd] [p] en is land [p] en is [l] an [d] [p] en is [l] and [p] en island zatxtaz: [z] at [xtaz] xapenx: [x] ape [nx] [xa] pen [x] [xap] en [x]
    and for the 2of4brif dictionary
    penisland: pen is land pen is [l] an [d] pen is [l] and pen island penis land penis [l] an [d] penis [l] and zatxtaz: [z] at [x] ta [z] xapenx: [x] ape [nx] [xa] pen [x]
      pjotrik,
      I like the way you distinguish between word substrings and non-word substrings. Unfortunately, the order isn't correct. Given the way you return the results and print them, adding an order wouldn't be difficult. I haven't tested this against any large sample data with potential cases but it looks good from eyeballing it :-)

      Cheers - L~R