Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

Regular expression help

by ananassa (Initiate)
on Nov 21, 2012 at 11:03 UTC ( #1004901=perlquestion: print w/ replies, xml ) Need Help??
ananassa has asked for the wisdom of the Perl Monks concerning the following question:

Deal all, I need to find match between two tab delimited files files like this:

File 1:
ID1 1 65383896 65383896 G C PCNXL3 ID1 2 56788990 55678900 T A ACT1 ID1 1 56788990 55678900 T A PRO55 File 2 ID2 34 65383896 65383896 G C MET5 ID2 2 56788990 55678900 T A ACT1 ID2 2 56788990 55678900 T A HLA

what I would like to do is to retrive the matching line between the two file. What I would like to match is everyting after the gene ID So far I have written this code but unfortunately perl keeps giving me the error: use of "Use of uninitialized value in pattern match (m//)" Could you please help me figure out where i am doing it wrong? Thank you in advance!

#!/usr/bin/perl -w use strict; open (INA, $ARGV[0]) || die "cannot to open gene file"; open (INB, $ARGV[1]) || die "cannot to open coding_annotated.var files +"; my @sample1 = <INA>; my @sample2 = <INB>; foreach my $line (@sample1) { my @tab = split (/\t/, $line); my $chr = $tab[1]; my $start = $tab[2]; #my $end = $tab[3]; my $ref = $tab[4]; my $alt = $tab[5]; my $name = $tab[6]; foreach my $item (@sample2){ my @fields = split (/\t/,$item); if ($fields[1]=~ m/$chr(.*)/ && $fields[2]=~ m/$start(.*)/ && +$fields[4]=~ m/$ref(.*)/ && $fields[5]=~ m/$alt(.*)/&& $fields[6]=~ m +/$name(.*)/){ print $line,"\n",$item; } } }

Comment on Regular expression help
Select or Download Code
Re: Regular expression help
by Anonymous Monk on Nov 21, 2012 at 11:22 UTC
Re: Regular expression help
by roboticus (Canon) on Nov 21, 2012 at 13:21 UTC

    ananassa:

    Your code is assuming that you're getting seven columns from your split on line 22. Then you're using several of those to do a regex match. However the last line of your file, for example, is likely empty (because many files end in a '\n'), causing you to have fewer than 2 columns fields array.

    You might try something like:

    ... my @fields = split /\t/, $item; next if @fields < 7; ...

    ...roboticus

    When your only tool is a hammer, all problems look like your thumb.

Re: Regular expression help
by Don Coyote (Monk) on Nov 21, 2012 at 14:33 UTC

    Anannassa, I think you are expending too much energy on splitting up the lines to match the separate parts

    Though to clear the id you could use split. I have suggested a substition to strip the id out of the line. You then just compare the whole of the remaining line against the 2nd file lines.

    This is untested but, should do the trick, if not give you another way to approach the problem.

    #strip the id, repeating for each @sample array as necessary foreach (@sample){ s/^ID\d+\s+(.*)/$1/; } #set an array to catch the mathcing lines my @matches #now compare each array element in @sample1 to each array element in @ +sample2 my ($line1, $line2); foreach $line1 (@sample1){ foreach $line2 (@sample2){ push(@matches, $line2) if ($line2 eq $line1); } } print join("\n", @matches);

    this type of thing can also be done using greps,maps and hash key indexing.

Re: Regular expression help
by space_monk (Chaplain) on Nov 21, 2012 at 18:15 UTC

    From what I can tell, you are checking everything after the first field matches. The best way to do this is to put the rows of the first file into a hash and then use the rows of the second file to do a lookup to see if the hash entry exists.

    The following code is not guaranteed to run (I had a long night last night!) but should show the general idea....

    #!/usr/bin/perl -w use strict; open (INA, $ARGV[0]) || die "cannot to open gene file"; open (INB, $ARGV[1]) || die "cannot to open coding_annotated.var files +"; my @sample1 = <INA>; my @sample2 = <INB>; # use map for this maybe? foreach my $line (@sample1) { my ($id, $rest) = split( '\t', $line, 2); chomp ($rest); $hash1{$rest} = $id; } foreach my $line (@sample2) { my ($id, $rest) = split( '\t', $line, 2); chomp( $rest); if (exists($hash1{$rest}) { print "Match: $line\n"; } }
    A Monk aims to give answers to those who have none, and to learn from those who know more.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1004901]
Approved by rovf
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others imbibing at the Monastery: (9)
As of 2014-09-19 20:20 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (145 votes), past polls