Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Matching complementary base pairs from 2 different files

by Meetali16 (Novice)
on Jan 01, 2018 at 06:39 UTC ( #1206505=perlquestion: print w/replies, xml ) Need Help??
Meetali16 has asked for the wisdom of the Perl Monks concerning the following question:

Hello All, I am very new to perl, and I have two files tab separated: File1:
rs492602 **CC** Vitamin B12 deficiency FUT2 Higher lev +els of vitamin B12 rs492602 CT Vitamin B12 deficiency FUT2 Normal lev +els of vitamin B12 rs492602 TT Vitamin B12 deficiency FUT2 Normal le +vels of vitamin B12
File2:
rs492602 **GG** exm-rs492602 rs492602 **CC** exm-rs492602
Result:
rs492602**CC** **GG** Vitamin B12 deficiency FUT2 Higher levels +of vitamin B12 rs492602**CC** **CC** Vitamin B12 deficiency FUT2 Higher lev +els of vitamin B12
So far I have tried this which matches only the RS_ids and not the basepairs:
use warnings; use strict; open(F1,$ARGV[0]) or die("could not open $ARGV[0] due to $!\n"); open(F2,$ARGV[1]) or die("could not open $ARGV[1] due to $!\n"); my @arr1=<F1>; my @arr2=<F2>; chomp(@arr1); chomp(@arr2); my $x=shift(@arr1); my $i=0; print "$x\n"; foreach my $line1(@arr2){ chomp($line1); foreach my $line2(@arr1){ chomp($line2); $line2=~/(\w+)\t.*/; my $rsid=$1; #while($rsid){ # $i++; #} if($line1 eq $rsid){ print "$line2\n"; $i++; } } }
Thank you!

Replies are listed 'Best First'.
Re: Matching complementary base pairs from 2 different files
by Athanasius (Chancellor) on Jan 01, 2018 at 09:16 UTC

    Hello Meetali16, and welcome to the Monastery!

    I assume that the “Result” you give is the output you want to receive? If so, it’s difficult to see why some matches are included and others excluded. Looking at the rsid fields doesn’t help, since all lines in both input files have the same rsid. Matching base pairs between the two files would then result in 6 matches:

    rs492602 CC GG Vitamin B12 deficiency FUT2 Higher levels of vita +min B12 rs492602 CC CC Vitamin B12 deficiency FUT2 Higher levels of vita +min B12 rs492602 CT GG Vitamin B12 deficiency FUT2 Normal levels of vita +min B12 rs492602 CT CC Vitamin B12 deficiency FUT2 Normal levels of vita +min B12 rs492602 TT GG Vitamin B12 deficiency FUT2 Normal levels of vita +min B12 rs492602 TT CC Vitamin B12 deficiency FUT2 Normal levels of vita +min B12

    — unless the asterisks are significant? Please clarify.

    In the meantime, I will propose a general strategy: Decide which of the two input files is likely to be shorter, and read the contents of that file into a hash. (The format for the hash will depend on the type of matching you require.) Then read the larger file, line by line, extracting its rsid and base pair fields and matching against the appropriate fields in the hash. Hash lookup is one of the areas where Perl really shines.

    I notice you increment $i on each match, but never use it. I’m guessing you want to:

    print "Found $i matches\n";

    at the end of the script?

    Please clarify what you are trying to achieve, to make it easier for the Monks to help you — and please remember that most of us are not biologists!

    Cheers,

    Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

      Yes, you are absolutely right!

      no the asterisks are not significant, they were just to highlight the area of focus

      and yes "the 6 results is all I want!!" I shall work on the idea and be careful about non-biologists.

        In that case, the following should be close to what you’re looking for:

        use strict; use warnings; open(my $f1, '<', $ARGV[0]) or die "Could not open '$ARGV[0]' for reading, stopped"; open(my $f2, '<', $ARGV[1]) or die "Could not open '$ARGV[1]' for reading, stopped"; my @arr1 = <$f1>; my @arr2 = <$f2>; close $f1 or die "Could not close '$ARGV[0]', stopped"; close $f2 or die "Could not close '$ARGV[1]', stopped"; my %hash; for my $element (@arr2) { push @{ $hash{$1} }, $2 if $element =~ m{ ^ \s* (rs\d+) \s+ \** (\w{2}) }x; } my $i = 0; for my $element (@arr1) { if (my ($rsid, $base1, $description) = $element =~ m{ ^ \s* (rs\d+) \s+ \** (\w{2}) \** \s+ (.*) $ }x +) { if (exists $hash{$1}) { for my $base2 (@{ $hash{$1} }) { print "$rsid $base1 $base2 $description\n"; ++$i; } } } } print "Found $i matches\n";

        Output:

        23:14 >perl 1852_SoPW.pl file1 file2 rs492602 CC GG Vitamin B12 deficiency FUT2 Higher levels of vita +min B12 rs492602 CC CC Vitamin B12 deficiency FUT2 Higher levels of vita +min B12 rs492602 CT GG Vitamin B12 deficiency FUT2 Normal levels of vita +min B12 rs492602 CT CC Vitamin B12 deficiency FUT2 Normal levels of vita +min B12 rs492602 TT GG Vitamin B12 deficiency FUT2 Normal levels of vita +min B12 rs492602 TT CC Vitamin B12 deficiency FUT2 Normal levels of vita +min B12 Found 6 matches 23:14 >

        Hope that helps,

        Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

Re: Matching complementary base pairs from 2 different files
by poj (Monsignor) on Jan 01, 2018 at 09:01 UTC

    Is the result shown what you want ? It isn't the result I get from running your code.

    poj
      It is the result I want.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1206505]
Approved by haukex
Front-paged by haukex
help
Chatterbox?
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others contemplating the Monastery: (4)
As of 2018-06-24 00:02 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Should cpanminus be part of the standard Perl release?



    Results (126 votes). Check out past polls.

    Notices?