Challenge: prefix($x, $y) . suffix($x, $z) eq $x

All,
The other day blokhead told me about an interesting pairing on the Women's Tennis World Tour. Consider the following 3 players. Specifically, there last names and the relationship between prefix and suffix matching.

Caroline Wozniacki
Aleksandra Wozniak
Sabine Lisicki

prefix(Wozniacki, Wozniak) = Woznia
suffix(Wozniacki, Lisicki) = cki

Woznia . cki = Wozniacki
[download]

He wondered if any other such pairings existed and if there were an elegant solution to the general problem. This interested me so I asked a few clarifying questions to establish some rules:

No empty matches (must have at least 1 letter in common)
Entire word can not be matched (prefix(hello, hell) is no good because 'hell' matches hell in its entirety)
The maximum match must be taken (prefix(blue, black) = bl - you can't just take the b for a better suffix match)

I came up with the following solution using a dictionary and not tennis players.

#!/usr/bin/perl
use strict;
use warnings;
use Storable;

my ($prefix_db, $suffix_db, $dict_db) = (qw/prefix.db suffix.db dict.d
+b/);

if (! -e $prefix_db || ! -e $suffix_db || ! -e $dict_db) {
    my (%prefix, %suffix, %dict);
    open(my $fh, '<', 'words.txt') or die "Unable to open 'words.txt' 
+for reading: $!";
    while (<$fh>) {
        tr/a-z//cd;
        next if length($_) < 2;
        build_tree(\%prefix, \%suffix, $_);
        $dict{$_} = undef;
    }
    store(\%prefix, $prefix_db);
    store(\%suffix, $suffix_db);
    store(\%dict,   $dict_db);
}
my ($prefix, $suffix, $dict) = (retrieve($prefix_db), retrieve($suffix
+_db), retrieve($dict_db));

my $req = $ARGV[0] or die "Usage: $0 <word>";
die "'$req' is not in dictionary" if ! exists $dict->{$req};

for my $idx (0 .. length($req) - 2) {
    my $pre  = substr($req, 0, $idx + 1);
    my $char = substr($req, $idx, 1);
    my $next = substr($req, $idx + 1, 1);
    my @head = map { $prefix->{$pre}{'*words*'}{$_} ne $next ? $_ : ()
+ } keys %{$prefix->{$pre}{'*words*'}};
    next if ! @head;  # Should this be last instead?
    my $suf  = substr($req, $idx + 1);
    my @tail = map { $suffix->{$suf}{'*words*'}{$_} ne $char ? $_ : ()
+ } keys %{$suffix->{$suf}{'*words*'}};
    next if ! @tail;  # Should this be last instead?
    # Cartesian Product Possible
    print "prefixmatch($req, $head[0]) = $pre, suffixmatch($req, $tail
+[0]) = $suf, $pre . $suf = $req\n";
}

sub build_tree {
    my ($prefix, $suffix, $word) = @_;

    # Prefix hello => h, he, hel, hell but not hello
    for my $idx (0 .. length($word) - 2) {
        my $pre = substr($word, 0, $idx + 1);
        $prefix->{$pre}{'*words*'}{$word} = substr($word, $idx + 1, 1)
+;
    }

    # Suffix hello => o, lo, llo, ello but not hello
    for my $idx (1 .. length($word) - 1) {
        my $suf = substr($word, $idx);
        $suffix->{$suf}{'*words*'}{$word} = substr($word, $idx - 1, 1)
+;
    }
}
__END_
$ ./match_me.pl hello

prefixmatch(hello, haughtiness) = h, suffixmatch(hello, bordello) = el
+lo, h . ello = hello
prefixmatch(hello, hepatic) = he, suffixmatch(hello, armadillo) = llo,
+ he . llo = hello
prefixmatch(hello, helical) = hel, suffixmatch(hello, gigolo) = lo, he
+l . lo = hello
prefixmatch(hello, hellhole) = hell, suffixmatch(hello, kangaroo) = o,
+ hell . o = hello

# Note:  Not all possible matches, just one of each prefix length
[download]

This has a heavy penalty the first time it is ran but is relatively fast after. Can you do better?

Cheers - L~R

Comment on Challenge: prefix($x, $y) . suffix($x, $z) eq $x Select or Download Code

Replies are listed 'Best First'.
Re: Challenge: prefix($x, $y) . suffix($x, $z) eq $x by BrowserUk (Patriarch) on Apr 22, 2009 at 20:06 UTC
This seems too easy, which usually means I've misinterpreted the rules somewhere? Read more... (5 kB) Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply] [d/l]
Re^2: Challenge: prefix($x, $y) . suffix($x, $z) eq $x by Limbic~Region (Chancellor) on Apr 22, 2009 at 22:11 UTC
BrowserUk, I believe you have the requirements spot on. I agree it wasn't much of a "challenge" but I like to keep such posts similarly named for searching purposes later. When blokhead first described the problem he was mentioning possibly using suffix trees and other doodads which made me wonder how other people would approach it. Update: The number of matches came as a big suprise to blokhead so I like the way you consolidated results. The reason I limited the matching to a single input was because of the enormous amount of output when doing all possible matches as blokhead first asked for though I coded it to find all matches. Would your approach have changed if you were doing all matches not just one for the input provided? Cheers - L~R	[reply]
Re^3: Challenge: prefix($x, $y) . suffix($x, $z) eq $x by BrowserUk (Patriarch) on Apr 23, 2009 at 00:44 UTC
Would your approach have changed if you were doing all matches not just one for the input provided? Of necessity it would have to change a bit if it were going to look up more than one word per run. I saved a litte time by pre-filtering the pre/suffix hashes to avoid storing prefixed that would contravene the longest prefix only rules. That means those hashes are specific to the word being dealt with. To match more than one word per run, requires deferring the filtering until later. Here's my first attempt at that. It deals with the tennis players file in under 1/3 seconds, so I've made no attempt to optimise it: #! perl -sw use 5.010; use strict; use Time::HiRes qw[ time ]; use Data::Dump qw[ pp ]; my $start = time; die "Need words filename" unless @ARGV and -e $ARGV[ 0 ]; open WORDS, '<', $ARGV[ 0 ] or die $!; chomp( my @words = <WORDS> ); close WORDS; my( %pre, %suf ); for ( @words ) { for my $p ( 1 .. length() -1 ) { push @{ $pre{ substr $_, 0, $p } }, $_; push @{ $suf{ substr $_, - $p } }, $_; } } for my $word ( @words ) { for my $p ( 1 .. length( $word ) - 1 ) { my $pre = substr $word, 0, $p; my $prePlus = substr $word, 0, $p+1; my $suf = substr $word, -( length( $word ) - $p ); my $sufPlus = substr $word, -( length( $word ) - ( $p-1 ) ); if( exists $pre{ $pre } and exists $suf{ $suf } ) { my @pre = grep{ !/^$prePlus/ } @{ $pre{ $pre } }; next unless @pre; my @suf = grep{ !/$sufPlus$/ } @{ $suf{ $suf } }; next unless @suf; say "$pre . $suf = $word"; printf "\t$pre ( %s )\n", @pre <= 10 ? join( ' ', @pre ) : join( ' ', @pre[ 0 .. 9 ], '... ' . @pre ); printf "\t$suf ( %s )\n", @suf <= 10 ? join( ' ', @suf ) : join( ' ', @suf[ 0 .. 9 ], '... ' . @suf ); } } } printf STDERR "Took: %.2f seconds\n", time() -$start; __END__ C:\test>759369 players.txt >nul Took: 0.29 seconds C:\test>759369 players.txt a . hn = ahn a ( abramovic adamczak afinogenova aguilar akiki akita alawi a +lbanese albuquerque aleksandrova ... 43 ) hn ( mohn ) aki . ki = akiki aki ( akita ) ki ( dabrowski filipovski jovanovski kitazaki lisicki miyazaki + solanki wozniacki ) aki . ta = akita aki ( akiki ) ta ( costa konta namigata pennetta tananta yokota zanchetta ) ... w . ong = wong w ( wang wannasuk warburton washington webleysmith weidemann w +einhold wejnert welford westbury ... 20 ) ong ( hong jeong keothavong tangphong zhong ) wo . ng = wong wo ( woerle wowchuk wozniacki wozniak ) ng ( chang cheng chuang frilling haring herring huang hwang ka +ng king ... 16 ) woznia . cki = wozniacki woznia ( wozniak ) cki ( lisicki ) woznia . k = wozniak woznia ( wozniacki ) k ( antoniychuk black blank buryachok czink ewijk fink fitzpat +rick gawlik grajdek ... 29 ) x . i = xi x ( xu ) i ( akiki alawi alnabhani andrei ani appineni arai bai balducc +i bartoli ... 93 ) x . ie = xie x ( xu ) ie ( binnie delefortrie elie ) x . u = xu x ( xi xie ) u ( anghelescu begu buzarnescu cadantu daniilidou dulgheru faf +aliou georgatou gerasimou hincu ... 28 ) ... z . hong = zhong z ( zabala zafirova zagorska zahlavova zahlavovastrycova zaja +zakopalova zanchetta zaniewska zecpeskiric ... 17 ) hong ( tangphong ) zh . ong = zhong zh ( zhang zhao zharkova zheng ) ong ( jeong keothavong wong ) zho . ng = zhong zho ( zhou ) ng ( chang cheng chuang frilling haring herring huang hwang ka +ng king ... 16 ) zh . ou = zhou zh ( zhang zhao zharkova zheng ) ou ( daniilidou fafaliou georgatou gerasimou ) zho . u = zhou zho ( zhong ) u ( anghelescu begu buzarnescu cadantu dulgheru hincu hisamats +u hsu liu lu ... 24 ) zo . ric = zoric zo ( zotter zovko ) ric ( majeric njiric zecpeskiric ) Took: 3.79 seconds [download] Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply] [d/l]
Re: Challenge: prefix($x, $y) . suffix($x, $z) eq $x by shmem (Chancellor) on Apr 22, 2009 at 21:30 UTC
Here's my straight forward solution, ungolfed... <Reveal this spoiler or all in this thread> timed as `real 0m0.387s user 0m0.381s sys 0m0.003s` [download] update: using Limbic~Region's name list I get `real 0m12.441s user 0m12.219s sys 0m0.015s` [download] and replacing m// with index in `sub suffix` above `real 0m6.405s user 0m6.273s sys 0m0.007s` [download] on a Pentium M 2.10GHz, perl 5.8.8 - What are your timings? What is "better"? Code readability, brevity, maintainability, performance? ;-)	[reply] [d/l] [select]
Re^2: Challenge: prefix($x, $y) . suffix($x, $z) eq $x by Limbic~Region (Chancellor) on Apr 22, 2009 at 22:18 UTC
shmem, What is "better"? Bah, I was just baiting folks into contributing. I do see a syntax error in what you posted (last if block is missing closing curly)? After correcting that typo, it takes about 11 wallclock seconds to run the tennis name list. Thanks for your contribution :-) Cheers - L~R	[reply]
Re^3: Challenge: prefix($x, $y) . suffix($x, $z) eq $x by shmem (Chancellor) on Apr 22, 2009 at 22:27 UTC
I do see a syntax error Thanks, corrected. - I seem to just have learned something about index: use it! Did cut the wallclock time to the half it did... ;-) ... see my update.	[reply]
Re^4: Challenge: prefix($x, $y) . suffix($x, $z) eq $x by Limbic~Region (Chancellor) on Apr 22, 2009 at 22:35 UTC
Re^5: Challenge: prefix($x, $y) . suffix($x, $z) eq $x by shmem (Chancellor) on Apr 22, 2009 at 23:59 UTC
Some notes below your chosen depth have not been shown here
Challenge: prefix($x, $y) . suffix($x, $z) eq $x (tennis players) by Limbic~Region (Chancellor) on Apr 22, 2009 at 18:58 UTC
All, The answer to blokhead's original question is yes - many pairings have the same property as he had found. If you would like to use that as a sample set instead of a dictionary file, I am providing the player last names below: Read more... (9 kB)	[reply] [d/l]
Re: Challenge: prefix($x, $y) . suffix($x, $z) eq $x by MidLifeXis (Monsignor) on Apr 22, 2009 at 21:30 UTC
Considering how some names were constructed, I would think (although this is from not being educated in how names evolved) that this could be fairly common... Relationships - Johnson (John's Son), Jackson, etc Occupations - Cooper (the barrel maker), Farmer, Hunter, Smith Variants on one of the above - Johnson, Johnston, Jonson, Jensen; Smith, Smithe, Smythe I would surmise that you would see many common prefixes and suffixes being present among a large enough population of names. This is, however, an interesting problem. I can see bringing this one out later on if any of my kids take an interest in programming. --MidLifeXis The tomes, scrolls etc are dusty because they reside in a dusty old house, not because they're unused. --hangon in this post	[reply]
Re: Challenge: prefix($x, $y) . suffix($x, $z) eq $x by FunkyMonk (Chancellor) on Apr 22, 2009 at 21:43 UTC
/me joins in the fun Read more... (1472 Bytes) Sample output: `woznia.cki PREFIX woznia: wozniak SUFFIX cki: lisicki zh.ou PREFIX zh: zhang zhao zharkova zheng zhong SUFFIX ou: daniilidou fafaliou georgatou gerasimou zho.u PREFIX zho: zhong SUFFIX u: anghelescu begu buzarnescu cadantu daniilidou dulgheru faf +aliou georgatou gerasimou hincu hisamatsu hsu liu lu mitu nedelcu nic +ulescu olaru perianu qiu radu senoglu shimizu stancu tigu vaideanu xu + yu` [download] ~~and the full output~~ eek. There's an awfull lot of it (>200K) for the full list of players Timings (full run): `real 0m0.266s user 0m0.256s sys 0m0.012s` [download]	[reply] [d/l] [select]
Re^2: Challenge: prefix($x, $y) . suffix($x, $z) eq $x by Limbic~Region (Chancellor) on Apr 22, 2009 at 22:28 UTC
FunkyMonk, I believe this code is flawed in that it violates two of the rules. The maximum match must be taken. Entire word can not be matched Changing the players for a dictionary file, I tried 'hello'. It produced the following: `h.ello PREFIX h: ha haberdasher ..... he head` [download] Neither 'he' nor 'head' should be matched here because they share more than just 'h' in common with 'hello'. 'he' should be excluded because it consumes the entire word. Cheers - L~R	[reply] [d/l]


Perl Monk, Perl Meditation
	PerlMonks