http://www.perlmonks.org?node_id=712498


in reply to Re: Challenge: "Words" In A String
in thread Challenge: "Words" In A String

All right, here goes the code. I've kept it simple, using trie instead of hash as ikegami does would surely improve performance.
#!/usr/bin/perl use warnings; use strict; my %dictionary; open(my $dict_file, '<', '2of12inf.txt') or die ("Can't open dictionar +y: $!"); while (<$dict_file>) { $_ = lc($_); tr/a-z//dc; $dictionary{$_} = 1; } my @partials; while (<DATA>) { $_ = lc($_); tr/a-z//dc; @partials = (); $partials[0] = ['']; my $results = wordify($_); print "$_:\n"; print "$_\n" for @$results; } sub wordify { my ($string) = @_; my $results = []; my $min_end = length($string); for (my $pos = 0; $pos < $min_end; $pos++) { for my $len (1 .. length($string) - $pos) { my $word = substr($string, $pos, $len); if (exists $dictionary{$word}) { $min_end = $pos+$len if $pos+$len < $m +in_end; my $prefix = ''; $prefix = '[' . substr($string, 0, $po +s) . '] ' unless $pos == 0; $prefix .= $word; my $rest = substr($string, $pos + $len +); wordify($rest) unless (defined $partia +ls[length($rest)]); my $endings = $partials[length($rest)] +; push(@$results, map("$prefix $_", @$en +dings)); } } } $results = ["[$string]"] unless @$results; $partials[length($string)] = $results; return $results; } __DATA__ penisland zatxtaz xapenx
Giving the result (brackets denote the non-word fragments)
penisland: pen is la [nd] pen is land pen is [l] an [d] pen is [l] and pen island penis la [nd] penis land penis [l] an [d] penis [l] and [p] en is la [nd] [p] en is land [p] en is [l] an [d] [p] en is [l] and [p] en island zatxtaz: [z] at [xtaz] xapenx: [x] ape [nx] [xa] pen [x] [xap] en [x]
and for the 2of4brif dictionary
penisland: pen is land pen is [l] an [d] pen is [l] and pen island penis land penis [l] an [d] penis [l] and zatxtaz: [z] at [x] ta [z] xapenx: [x] ape [nx] [xa] pen [x]

Replies are listed 'Best First'.
Re^3: Challenge: "Words" In A String
by Limbic~Region (Chancellor) on Sep 19, 2008 at 13:46 UTC
    pjotrik,
    I like the way you distinguish between word substrings and non-word substrings. Unfortunately, the order isn't correct. Given the way you return the results and print them, adding an order wouldn't be difficult. I haven't tested this against any large sample data with potential cases but it looks good from eyeballing it :-)

    Cheers - L~R