Beefy Boxes and Bandwidth Generously Provided by pair Networks RobOMonk
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

help with comparing two arrays of phrases

by sdtej (Initiate)
on Oct 10, 2012 at 06:53 UTC ( #998153=perlquestion: print w/ replies, xml ) Need Help??
sdtej has asked for the wisdom of the Perl Monks concerning the following question:

Dear Perl Monks, I have a text file (tab-delimited) that has three columns. The first column is an ID, and the next two columns are phrases (descriptions). I am trying to write a code that will compare the two phrases and see if they are "similar". The way I chose to go about it is to split the first phrase into individual words, skip the words if they are shorter than 3 chars, and then check if the word is part of the second phrase. As the output I chose to print every match. And here is the code I wrote so far

my @data; while(<>) { push @data, $_; } foreach my $line (@data) { my @temp_array = split "\t", $line; # Split columns into an array $temp_array[1] =~ tr/\"\-\/,/ /; #Change all potential word ending +s to a single space $temp_array[1] =~ tr/\(\)//d; # Remove parentheses to avoid mishap +s during pattern matching $temp_array[2] =~ tr/\"\-\/,/ /; #Same as above $temp_array[2] =~ tr/\(\)//d; #Same as above my @words = split " ", $temp_array[1]; # Split first phrase into i +ndividual words for(my $i = 0; $i < @words; $i++) { my $match_count = 1; if(length ($words[$i]) < 3) { next; } elsif(length ($words[$i]) < 5) { if($words[$i] =~ /$temp_array[2]/i) { print "Match $match_count (probable): $words[$i]\n +"; $match_count++; } else { next; } } else { if($words[$i] =~ /$temp_array[2]/i) { print "Match $match_count: $words[$i] \n"; $match_count++; } else { next; } } } }

Running this code is producing no output and warning "Unmatched parenthesis in regex" though I'm removing all parenthesis from the text. All my debugging and testing my code points the error to be in pattern matching. Is there any other way to achieve what I want (a case-insensitive substring matching that is)? Or, even better, has someone else already wrote such a code? Here are the first five lines of my input for your reference:

MIP_00001 Chromosomal replication initiator protein dnaA chro +mosomal replication initiationprotein MIP_00002 DNA polymerase III subunit beta DNA polymerase III +subunit beta MIP_00003 DNA replication and repair protein recF recombinati +on protein F MIP_00004 Hypothetical protein hypothetical protein Rv0004 MIP_00006 DNA gyrase subunit B DNA gyrase subunit B

Kindly help me out. Thanks!

TEJ

Edit: Changed the code as suggested by BrowserUk

Edit No.2: Got it to work guys! I just had to interchange the variables to either sides in my pattern matching! Stupid mistake, LOL :P Thanks for all the support :)

Comment on help with comparing two arrays of phrases
Select or Download Code
Re: help with comparing two arrays of phrases
by BrowserUk (Pope) on Oct 10, 2012 at 07:28 UTC

    The problem is that this:

    $temp_array[1] =~ s/\(\)//g; # Remove parentheses to avoid mishaps + during pattern matching

    Will only remove parenthesis if the appears in together in matched pairs.

    Ie. It would remove these "fred () bill () john", but not these "(fred)(bill john)".

    A better way to remove all parens is to use tr///. Eg.  $temp_array[1] =~ tr/()//d;

    You could also use:  $temp_array[1] =~ s/[()]//g;


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    RIP Neil Armstrong

      Thanks, that sure solved the "Unmatched parenthesis in regex" problem. But I'm still not getting any output. As you can see from the sample input, the two phrases have more than one word which are the same..

        You need to look closely at your code and consider carefully what you are doing. Your code in no way matches up with your description.

        1. You say you skip words less than 3 chars, but your code contains tests for <3 and <5, and the code path for >4 seems identical to that for 3 & 4.
        2. You split the first phrase into words, but not the second.

          And you then test if each word in the first phrase contains the whole of the second phrase.

        3. You have a whole bunch of else { next; } clauses.

          Do you realise that if you omitted those clauses, it would just loop to the next iteration anyway?

          So what is their purpose?

        The bottom line is, with the disparity between your written description and you code, and a bunch of redundant stuff in your code, I cannot work out what you are actually trying to achieve, so I cannot really suggest anything.

        First decide and write down what you actually want to do; then try to write code that matches that description; and then, if it doesn't do what you want, come back and ask again.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        RIP Neil Armstrong

Re: help with comparing two arrays of phrases
by Neighbour (Friar) on Oct 10, 2012 at 08:47 UTC
    The reason you're getting no results is because your comparison is the wrong way around :). You are checking to see if the word matches the regex with the entire 2nd phrase instead of the other way around.
    I've taken the liberty of changing a few other things (sorry, couldn't resist). Amongst other things, the changes are in
    • how you load the file contents
    • how you remove special characters (and when)
    • how you loop through the words
    • how you keep track of the match_count
    • and how you perform the match

    use strict; use warnings; use v5.10; my @data = <>; my $match_count = 1; foreach my $line (@data) { chomp ($line); print "Processing line [$line]\n"; $line =~ s/[",\/-]/ /g; # Change all potential word endings to +a single space $line =~ s/[()]//g; # Remove parentheses to avoid mishaps d +uring pattern matching my ($id, $source, $comparison) = split "\t", $line; # Split col +umns into an array foreach my $word (split ' ', $source) { given (length $word) { when ($_ < 3) { next; } when ($_ < 5) { if ($comparison =~ /$word/i) { print "Match [$match_count] (probable): [$word]\n" +; $match_count++; } } default { if ($comparison =~ /$word/i) { print "Match [$match_count]: [$word]\n"; $match_count++; } } } } }

      Sorry for the late reply. Could not get online for a few days!!

      @Neighbor: Thanks for cleaning the code! This does look way better than mine, and will use your code for subsequent purposes :)

      @BrowserUk: Yes, my code does look confusing being the novice I am. However, I will keep those points you mentioned in mind from now on. Thanks a lot for all your suggestions and corrections!! :)

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://998153]
Approved by davido
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others imbibing at the Monastery: (14)
As of 2014-04-23 09:15 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    April first is:







    Results (541 votes), past polls