Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

Re^12: partial match between 2 files

by lakssreedhar (Acolyte)
on Dec 28, 2012 at 06:24 UTC ( #1010647=note: print w/ replies, xml ) Need Help??


in reply to Re^11: partial match between 2 files
in thread partial match between 2 files

The code produces an error "Use of uninitialized value in print" when a word awiSayanA is compared with 2 words
awiSayaM
awiSayanZ
present in dictionary.Output is not produced.


Comment on Re^12: partial match between 2 files
Re^13: partial match between 2 files
by Athanasius (Prior) on Dec 28, 2012 at 09:53 UTC

    OK, so it appears that by “maximum partial match” you mean longest common substring. A search on that phrase found the thread finding longest common substring, from which I derived the following:

    Update 1 (1st January, 2013):

    Algorithm::Diff is actually the wrong module for this, I should have used String::LCSS. The former finds non-contiguous sub-sequences; the latter finds substrings. Revised code:

    #! perl use strict; use warnings; use String::LCSS; use constant { CASE_SENSITIVE => 0, DICTIONARY_FILE => 'words.txt', }; print 'Enter the target word: '; chomp(my $orig_target = <STDIN>); my $target = CASE_SENSITIVE ? $orig_target : lc $orig_target; open(my $in, '<', DICTIONARY_FILE) or die "Cannot open file '" . DICTIONARY_FILE . "' for reading: $! +"; my %substrings; while (my $orig_word = <$in>) { chomp $orig_word; my $word = CASE_SENSITIVE ? $orig_word : lc $orig_word; my @lcss = lcss($word, $target); $substrings{ $lcss[0] } = [ $orig_word, $lcss[1], $lcss[2] ] if $l +css[0]; } close $in or die "Cannot close file '" . DICTIONARY_FILE . "': $!"; print 'Target: ', $orig_target, "\n"; if (%substrings) { my $key = (sort { length $a <=> length $b } keys %substrings)[- +1]; my $match = $substrings{ $key }->[0]; my $index2 = $substrings{ $key }->[2]; my $substr = substr($orig_target, $index2, length $key); print 'Closest match: ', $match, "\n"; print 'Longest common substring: ', $substr, "\n"; } else { print "No matches found\n"; } sub lcss { my ($first, $second) = @_; $first .= '$'; # force strings to be different: $second .= '@'; # kludge required by String::LCSS::lcss my @results = String::LCSS::lcss($first, $second); return wantarray ? @results : $results[0]; }

    Update 2 (1st January, 2013):

    It appears that String::LCSS is more badly broken than I realised. Even simple matches can fail to find the longest common substring:

    18:27 >perl -MString::LCSS=lcss -wE "say scalar lcss('abxabcy', 'abc') +;" ab 18:28 >

    (And see http://cpanratings.perl.org/dist/String-LCSS.)

    Better to replace sub lcss in the above script with the following by BrowserUk in Re: finding longest common substring:

    sub lcss { my $strings = join "\0", @_; my $lcs; for my $n (1 .. length $strings) { my $re = "(.{$n})" . '.*\0.*\1' x (@_ - 1); last unless $strings =~ $re; $lcs = $1; } return $lcs; }

    Update 3 (2nd January, 2013):

    Discovered the thread Does String::LCSS work?. String::LCSS is indeed broken, but String::LCSS_XS seems to work correctly.

    Hope that helps,

    Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

      A problem exists like this gets only longest common sub string.suppose my dictionary has words crick, cricketer and testing input is cricking,the match should be to crick but the program gives cricketer as closest matching string

        Hello lakssreedhar,

        As I understand it, you’re writing a script which accepts as input a word (which I will call the target) and the name of a dictionary file. The purpose of the script is to identify the best match for the target (if any) among the entries in the dictionary. Well, there would seem to be four possible ways in which a dictionary entry can match the target word:

        1. They are identical
        2. The target is a substring of the dictionary entry
        3. The dictionary entry is a substring of the target
        4. They share a common substring.

        I suppose it’s obvious that if (1) is satisfied, this is the best match. And from earlier posts in this thread, it appears that if none of (1), (2), or (3) is true, then (4) looks for the the longest common substring (i.e., the “best” match is the one with the longest substring in common). But now things get murky. What if both (2) and (3) are satisfied? For example, the target is reread and the dictionary contains both read and rereading? And this is only one example of the sort of edge cases that need to be considered.

        suppose my dictionary has words crick, cricketer and testing input is cricking,the match should be to crick

        OK, but why? What are the rules for a “best match”? These rules need to be specified very precisely (and of course only you can do this, since only you know what your script is intended to do).

        Here is how I would approach the problem:

        • Think through the above considerations, and come up with a precise set of rules describing exactly what is the “best match” in all possible cases.
        • Illustrate each of these rules with sample input (target word and dictionary contents) together with the desired output.
        • Use this sample input to create a simple test harness that will test your script. (For example, see Test::Tutorial and What are some good testing references?.)
        • Design an algorithm to apply the rules for matching, drawing on the advice and techniques given in other posts in this thread.
        • Implement the algorithm and test it by running the test harness.

        Hope that gives you some help,

        Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1010647]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others cooling their heels in the Monastery: (4)
As of 2014-07-14 00:45 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    When choosing user names for websites, I prefer to use:








    Results (253 votes), past polls