note
pjotrik
All right, here goes the code. I've kept it simple, using trie instead of hash as [ikegami] does would surely improve performance.
<code>
#!/usr/bin/perl
use warnings;
use strict;
my %dictionary;
open(my $dict_file, '<', '2of12inf.txt') or die ("Can't open dictionary: $!");
while (<$dict_file>) {
$_ = lc($_);
tr/a-z//dc;
$dictionary{$_} = 1;
}
my @partials;
while (<DATA>) {
$_ = lc($_);
tr/a-z//dc;
@partials = (); $partials[0] = [''];
my $results = wordify($_);
print "$_:\n";
print "$_\n" for @$results;
}
sub wordify {
my ($string) = @_;
my $results = [];
my $min_end = length($string);
for (my $pos = 0; $pos < $min_end; $pos++) {
for my $len (1 .. length($string) - $pos) {
my $word = substr($string, $pos, $len);
if (exists $dictionary{$word}) {
$min_end = $pos+$len if $pos+$len < $min_end;
my $prefix = '';
$prefix = '[' . substr($string, 0, $pos) . '] ' unless $pos == 0;
$prefix .= $word;
my $rest = substr($string, $pos + $len);
wordify($rest) unless (defined $partials[length($rest)]);
my $endings = $partials[length($rest)];
push(@$results, map("$prefix $_", @$endings));
}
}
}
$results = ["[$string]"] unless @$results;
$partials[length($string)] = $results;
return $results;
}
__DATA__
penisland
zatxtaz
xapenx
</code>
Giving the result (brackets denote the non-word fragments)
<code>
penisland:
pen is la [nd]
pen is land
pen is [l] an [d]
pen is [l] and
pen island
penis la [nd]
penis land
penis [l] an [d]
penis [l] and
[p] en is la [nd]
[p] en is land
[p] en is [l] an [d]
[p] en is [l] and
[p] en island
zatxtaz:
[z] at [xtaz]
xapenx:
[x] ape [nx]
[xa] pen [x]
[xap] en [x]
</code>
and for the <c>2of4brif</c> dictionary
<code>
penisland:
pen is land
pen is [l] an [d]
pen is [l] and
pen island
penis land
penis [l] an [d]
penis [l] and
zatxtaz:
[z] at [x] ta [z]
xapenx:
[x] ape [nx]
[xa] pen [x]
</code>
712392
712402