http://www.perlmonks.org?node_id=998153

sdtej has asked for the wisdom of the Perl Monks concerning the following question:

Dear Perl Monks, I have a text file (tab-delimited) that has three columns. The first column is an ID, and the next two columns are phrases (descriptions). I am trying to write a code that will compare the two phrases and see if they are "similar". The way I chose to go about it is to split the first phrase into individual words, skip the words if they are shorter than 3 chars, and then check if the word is part of the second phrase. As the output I chose to print every match. And here is the code I wrote so far

my @data; while(<>) { push @data, $_; } foreach my $line (@data) { my @temp_array = split "\t", $line; # Split columns into an array $temp_array[1] =~ tr/\"\-\/,/ /; #Change all potential word ending +s to a single space $temp_array[1] =~ tr/\(\)//d; # Remove parentheses to avoid mishap +s during pattern matching $temp_array[2] =~ tr/\"\-\/,/ /; #Same as above $temp_array[2] =~ tr/\(\)//d; #Same as above my @words = split " ", $temp_array[1]; # Split first phrase into i +ndividual words for(my $i = 0; $i < @words; $i++) { my $match_count = 1; if(length ($words[$i]) < 3) { next; } elsif(length ($words[$i]) < 5) { if($words[$i] =~ /$temp_array[2]/i) { print "Match $match_count (probable): $words[$i]\n +"; $match_count++; } else { next; } } else { if($words[$i] =~ /$temp_array[2]/i) { print "Match $match_count: $words[$i] \n"; $match_count++; } else { next; } } } }

Running this code is producing no output and warning "Unmatched parenthesis in regex" though I'm removing all parenthesis from the text. All my debugging and testing my code points the error to be in pattern matching. Is there any other way to achieve what I want (a case-insensitive substring matching that is)? Or, even better, has someone else already wrote such a code? Here are the first five lines of my input for your reference:

MIP_00001 Chromosomal replication initiator protein dnaA chro +mosomal replication initiationprotein MIP_00002 DNA polymerase III subunit beta DNA polymerase III +subunit beta MIP_00003 DNA replication and repair protein recF recombinati +on protein F MIP_00004 Hypothetical protein hypothetical protein Rv0004 MIP_00006 DNA gyrase subunit B DNA gyrase subunit B

Kindly help me out. Thanks!

TEJ

Edit: Changed the code as suggested by BrowserUk

Edit No.2: Got it to work guys! I just had to interchange the variables to either sides in my pattern matching! Stupid mistake, LOL :P Thanks for all the support :)