Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid

Longest String Matching

by Dr Manhattan (Beadle)
on Apr 08, 2013 at 11:40 UTC ( #1027470=perlquestion: print w/replies, xml ) Need Help??

Dr Manhattan has asked for the wisdom of the Perl Monks concerning the following question:

Hi all

I am trying to write a script to match the longest string at the end of a word. I am trying to do this by splitting each word into an array, so that each letter becomes an array-element. I then have a hash(%lexicon) with a couple of words stored. If I have the words 'own' and 'known' in my hash, and I try to find the longest string at the end of the word 'unknown', the result should be 'known'. However the best I can do is get back 'own'. unknown = unkn + own, which is not what I want.

my $end; my $mover; my $right; foreach my $q (keys %wordlist) { push (@array, split(//, $q)); $end = $#array; $mover = $end-2; &rightside(); print "$right\n"; @array = (); } sub rightside { if (exists $lexicon{join q(), @array[$mover..$end]}) { $right = join q(), @array[$mover..$end]; } else { $mover--; &rightside(); } }

I tried to replace the subroutine with this

sub rightside { for (my $x = $#array-2; $x == 0; $x--) { if (exists $lexicon{join q(), @array[$x..$#array]}) { $right = join q(), @array[$x..$#array]; } } }

But it doesn't work either. Any help would be appreciated

Replies are listed 'Best First'.
Re: Longest String Matching
by space_monk (Chaplain) on Apr 08, 2013 at 11:56 UTC
    I'm open to better ideas.... :-)
    while (length($word) && !exists $lookup{$word}) { substr($word, 0, 1) = ""; # remove 1st char } # $word will be empty or the matched string

    Update: This probably works well if the lexicon contains a lot of words as the execution time of this is proportional only to the length of the word. hdb has contributed an improvement to reduced the complexity a little, using the common trick of guaranteeing a match for zero length. There is a regex solution below this which is interesting too...

    A Monk aims to give answers to those who have none, and to learn from those who know more.

      I do not think it can get much better. My proposal is not very useful for large dictionaries...

        Thanks; Maybe my solution is better than I first thought! :-)
        A Monk aims to give answers to those who have none, and to learn from those who know more.
Re: Longest String Matching
by kcott (Bishop) on Apr 08, 2013 at 12:14 UTC

    G'day Dr Manhattan,

    I think you decided that splitting the word into individual characters was a good approach and then got rather bogged down attempting to wring a solution from this technique. A regex using alternation would probably be a better way to do this.

    $ perl -Mstrict -Mwarnings -E ' my %lexicon = map { $_ => 1 } qw{own known}; my $string = q{unknown}; my $alt = join q{|} => sort { length $b <=> length $a } keys %lexi +con; my $re = qr{ ( (?> $alt ) ) $ }x; $string =~ $re; say $1; ' known

    Also, unless the hash %lexicon already exists for some other purpose, you'd probably be better off with an array:

    $ perl -Mstrict -Mwarnings -E ' my @lexicon = qw{own known}; my $string = q{unknown}; my $alt = join q{|} => sort { length $b <=> length $a } @lexicon; my $re = qr{ ( (?> $alt ) ) $ }x; $string =~ $re; say $1; ' known

    Note how sort puts the longest string first, i.e. to test for "known" before testing for "own". See perlreftut for an introduction to alternation; and perlre for more information as well as a description of the (?>pattern) construct (in the Extended Patterns section).

    -- Ken

Re: Longest String Matching
by choroba (Archbishop) on Apr 08, 2013 at 11:57 UTC
    You can use rindex to find the last occurrence of a substring in a string:
    #!/usr/bin/perl use warnings; use strict; my @lexikon = sort { length $b <=> length $a } qw/ own known /; my $word = "unknown"; print rightside($word), "\n"; sub rightside { my $word = shift; my $length = length $word; for my $l (@lexikon) { return $l if $length - length $l == rindex $word, $l; } return q(); }
    لսႽ ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ
Re: Longest String Matching
by hdb (Monsignor) on Apr 08, 2013 at 11:49 UTC

    Splitting into letters seems more cumbersome than using regular expressions:

    use strict; use warnings; my @lexikon = qw/ own known /; my $word = "unknown"; my $longestmatch = ""; foreach my $w (@lexikon) { if( $word =~ /$w$/ ) { $longestmatch=$w if length($w)>length($longestmatch); } } # $longestmatch now has the longest match $word =~ /(.*)($longestmatch$)/; print "$1+$2\n";
Re: Longest String Matching
by Rahul6990 (Beadle) on Apr 08, 2013 at 12:03 UTC
    try this:
    %wordlist= ("unknown", -1); %lexicon = ("own", 1, "known", 1); foreach my $q (keys %wordlist) { push (@array, split(//, $q)); $end = $#array; $mover = 1; print "\nend : $end \nmove : $mover\n"; &rightside(); print "$right\n"; @array = (); } sub rightside { last if($mover == 0 or $move == $end); if (exists $lexicon{join q(), @array[$mover..$end]}) { $right = join q(), @array[$mover..$end]; print "\nright:$right\n"; } else { $mover++; &rightside(); } }
    the problem in your code was, you were starting with the smallest string ($mover = 2 and $end = 6)which always return 'own' thats why the output was showing 'own'. Now we are starting with the largest possible string and keeps on reducing the string size till we find the matched string in lexicon or the string length becomes 0.
Re: Longest String Matching
by grizzley (Chaplain) on Apr 08, 2013 at 14:19 UTC
    As you know exactly where you want to match (at the end of string), you can first get length n of a word you are searching for, then using substr function get substring of length n of the word you are searching in and just do eq comparison of two strings:
    #!perl -l @array = ('own', 'known', 'owl'); $q = 'unknown'; $longestmatching = ''; $longestn = 0; for $testword(@array) { $n = length($testword); next if $n <= $longestn; $substr = substr($q, length($q)-$n, $n); if($testword eq $substr) { $longestmatching = $testword; $longestn = $n; } } print $longestmatching;
    Additionally you can do sort of @array from longest to shortest string as others mentioned.
Re: Longest String Matching
by AnomalousMonk (Bishop) on Apr 09, 2013 at 01:40 UTC

    I don't know if this will be as fast as pemungkah's approach, but the Perl regex engine builds a trie for alternations, so even quite large alternations (354K words in Moby file) are very fast. The big time sink is reading all the words and building the alternation string in the first place, but of course this could be done in a separate step and the resulting joined string saved to a file for later quick access. (Also note that the dictionary file I'm using has oddities such as 'words' like "d's" and "e's", and apparently every letter of the alphabet appears as a word.)

    >perl -wMstrict -le "my $wordlist = shift // die 'no wordlist filename given'; open my $fh_wordlist, '<', $wordlist or die qq{opening '$wordlist': $!}; my $words = join '|', sort { length($b) <=> length($a) } map { chomp; $_; } <$fh_wordlist> ; close $fh_wordlist or die qq{closing '$wordlist': $!}; ;; my @lexicon = qw( no now know known unknown antidisestablishment husband wife husband's wife's xyzzy a ); for my $word (@lexicon) { my ($longest_at_end) = $word =~ m{ \B ($words) \z }xms; print qq{'$word' -> '$longest_at_end'}; } " ..\..\moby\mwords\354984si.ngl 'no' -> 'o' 'now' -> 'ow' 'know' -> 'now' 'known' -> 'own' 'unknown' -> 'known' 'antidisestablishment' -> 'disestablishment' 'husband' -> 'band' 'wife' -> 'ife' 'husband's' -> 'd's' 'wife's' -> 'e's' 'xyzzy' -> 'y' Use of uninitialized value $longest_at_end in concatenation (.) or str +ing at -e line 1. 'a' -> ''
      Someone ought to run Benchmark over the various proposals above and see which work out quickest (and under what circumstances)
      A Monk aims to give answers to those who have none, and to learn from those who know more.
Re: Longest String Matching
by pemungkah (Priest) on Apr 08, 2013 at 20:35 UTC
    Reverse the string you're searching, and add the items you're searching for (reversed too) to a Tree::Trie, which is optimized for fast prefix searches. You're not using a has but it's still O(1).

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1027470]
Approved by Corion
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others wandering the Monastery: (1)
As of 2020-10-24 21:25 GMT
Find Nodes?
    Voting Booth?
    My favourite web site is:

    Results (247 votes). Check out past polls.