Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight

Regular expression help

by ananassa (Initiate)
on Nov 21, 2012 at 11:03 UTC ( #1004901=perlquestion: print w/replies, xml ) Need Help??
ananassa has asked for the wisdom of the Perl Monks concerning the following question:

Deal all, I need to find match between two tab delimited files files like this:

File 1:
ID1 1 65383896 65383896 G C PCNXL3 ID1 2 56788990 55678900 T A ACT1 ID1 1 56788990 55678900 T A PRO55 File 2 ID2 34 65383896 65383896 G C MET5 ID2 2 56788990 55678900 T A ACT1 ID2 2 56788990 55678900 T A HLA

what I would like to do is to retrive the matching line between the two file. What I would like to match is everyting after the gene ID So far I have written this code but unfortunately perl keeps giving me the error: use of "Use of uninitialized value in pattern match (m//)" Could you please help me figure out where i am doing it wrong? Thank you in advance!

#!/usr/bin/perl -w use strict; open (INA, $ARGV[0]) || die "cannot to open gene file"; open (INB, $ARGV[1]) || die "cannot to open coding_annotated.var files +"; my @sample1 = <INA>; my @sample2 = <INB>; foreach my $line (@sample1) { my @tab = split (/\t/, $line); my $chr = $tab[1]; my $start = $tab[2]; #my $end = $tab[3]; my $ref = $tab[4]; my $alt = $tab[5]; my $name = $tab[6]; foreach my $item (@sample2){ my @fields = split (/\t/,$item); if ($fields[1]=~ m/$chr(.*)/ && $fields[2]=~ m/$start(.*)/ && +$fields[4]=~ m/$ref(.*)/ && $fields[5]=~ m/$alt(.*)/&& $fields[6]=~ m +/$name(.*)/){ print $line,"\n",$item; } } }

Replies are listed 'Best First'.
Re: Regular expression help
by roboticus (Chancellor) on Nov 21, 2012 at 13:21 UTC


    Your code is assuming that you're getting seven columns from your split on line 22. Then you're using several of those to do a regex match. However the last line of your file, for example, is likely empty (because many files end in a '\n'), causing you to have fewer than 2 columns fields array.

    You might try something like:

    ... my @fields = split /\t/, $item; next if @fields < 7; ...


    When your only tool is a hammer, all problems look like your thumb.

Re: Regular expression help
by Anonymous Monk on Nov 21, 2012 at 11:22 UTC
Re: Regular expression help
by Don Coyote (Pilgrim) on Nov 21, 2012 at 14:33 UTC

    Anannassa, I think you are expending too much energy on splitting up the lines to match the separate parts

    Though to clear the id you could use split. I have suggested a substition to strip the id out of the line. You then just compare the whole of the remaining line against the 2nd file lines.

    This is untested but, should do the trick, if not give you another way to approach the problem.

    #strip the id, repeating for each @sample array as necessary foreach (@sample){ s/^ID\d+\s+(.*)/$1/; } #set an array to catch the mathcing lines my @matches #now compare each array element in @sample1 to each array element in @ +sample2 my ($line1, $line2); foreach $line1 (@sample1){ foreach $line2 (@sample2){ push(@matches, $line2) if ($line2 eq $line1); } } print join("\n", @matches);

    this type of thing can also be done using greps,maps and hash key indexing.

Re: Regular expression help
by space_monk (Chaplain) on Nov 21, 2012 at 18:15 UTC

    From what I can tell, you are checking everything after the first field matches. The best way to do this is to put the rows of the first file into a hash and then use the rows of the second file to do a lookup to see if the hash entry exists.

    The following code is not guaranteed to run (I had a long night last night!) but should show the general idea....

    #!/usr/bin/perl -w use strict; open (INA, $ARGV[0]) || die "cannot to open gene file"; open (INB, $ARGV[1]) || die "cannot to open coding_annotated.var files +"; my @sample1 = <INA>; my @sample2 = <INB>; # use map for this maybe? foreach my $line (@sample1) { my ($id, $rest) = split( '\t', $line, 2); chomp ($rest); $hash1{$rest} = $id; } foreach my $line (@sample2) { my ($id, $rest) = split( '\t', $line, 2); chomp( $rest); if (exists($hash1{$rest}) { print "Match: $line\n"; } }
    A Monk aims to give answers to those who have none, and to learn from those who know more.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1004901]
Approved by rovf
[LanX]: hmm could be a myth since the root of throne is Greek
[LanX]: erix : the one armed who spent his last years in Dorn castle was a prince of Oranie ;)
[erix]: why does that tell against the Kaiser-story?
[erix]: ah, of course, makes sense
[erix]: mafia are family based, of course
[LanX]: Doorn
[erix]: yes, I got it :) It's close to where I live (Utrecht)
[LanX]: Dutch royals usually spent their exiles on Berlin
[erix]: well, dutch royals are germans, really

How do I use this? | Other CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (11)
As of 2017-12-15 16:30 GMT
Find Nodes?
    Voting Booth?
    What programming language do you hate the most?

    Results (439 votes). Check out past polls.