http://www.perlmonks.org?node_id=1206507


in reply to Matching complementary base pairs from 2 different files

Hello Meetali16, and welcome to the Monastery!

I assume that the “Result” you give is the output you want to receive? If so, it’s difficult to see why some matches are included and others excluded. Looking at the rsid fields doesn’t help, since all lines in both input files have the same rsid. Matching base pairs between the two files would then result in 6 matches:

rs492602 CC GG Vitamin B12 deficiency FUT2 Higher levels of vita +min B12 rs492602 CC CC Vitamin B12 deficiency FUT2 Higher levels of vita +min B12 rs492602 CT GG Vitamin B12 deficiency FUT2 Normal levels of vita +min B12 rs492602 CT CC Vitamin B12 deficiency FUT2 Normal levels of vita +min B12 rs492602 TT GG Vitamin B12 deficiency FUT2 Normal levels of vita +min B12 rs492602 TT CC Vitamin B12 deficiency FUT2 Normal levels of vita +min B12

— unless the asterisks are significant? Please clarify.

In the meantime, I will propose a general strategy: Decide which of the two input files is likely to be shorter, and read the contents of that file into a hash. (The format for the hash will depend on the type of matching you require.) Then read the larger file, line by line, extracting its rsid and base pair fields and matching against the appropriate fields in the hash. Hash lookup is one of the areas where Perl really shines.

I notice you increment $i on each match, but never use it. I’m guessing you want to:

print "Found $i matches\n";

at the end of the script?

Please clarify what you are trying to achieve, to make it easier for the Monks to help you — and please remember that most of us are not biologists!

Cheers,

Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

Replies are listed 'Best First'.
Re^2: Matching complementary base pairs from 2 different files
by Meetali16 (Novice) on Jan 01, 2018 at 10:42 UTC
    Yes, you are absolutely right!

    no the asterisks are not significant, they were just to highlight the area of focus

    and yes "the 6 results is all I want!!" I shall work on the idea and be careful about non-biologists.

      In that case, the following should be close to what you’re looking for:

      use strict; use warnings; open(my $f1, '<', $ARGV[0]) or die "Could not open '$ARGV[0]' for reading, stopped"; open(my $f2, '<', $ARGV[1]) or die "Could not open '$ARGV[1]' for reading, stopped"; my @arr1 = <$f1>; my @arr2 = <$f2>; close $f1 or die "Could not close '$ARGV[0]', stopped"; close $f2 or die "Could not close '$ARGV[1]', stopped"; my %hash; for my $element (@arr2) { push @{ $hash{$1} }, $2 if $element =~ m{ ^ \s* (rs\d+) \s+ \** (\w{2}) }x; } my $i = 0; for my $element (@arr1) { if (my ($rsid, $base1, $description) = $element =~ m{ ^ \s* (rs\d+) \s+ \** (\w{2}) \** \s+ (.*) $ }x +) { if (exists $hash{$1}) { for my $base2 (@{ $hash{$1} }) { print "$rsid $base1 $base2 $description\n"; ++$i; } } } } print "Found $i matches\n";

      Output:

      23:14 >perl 1852_SoPW.pl file1 file2 rs492602 CC GG Vitamin B12 deficiency FUT2 Higher levels of vita +min B12 rs492602 CC CC Vitamin B12 deficiency FUT2 Higher levels of vita +min B12 rs492602 CT GG Vitamin B12 deficiency FUT2 Normal levels of vita +min B12 rs492602 CT CC Vitamin B12 deficiency FUT2 Normal levels of vita +min B12 rs492602 TT GG Vitamin B12 deficiency FUT2 Normal levels of vita +min B12 rs492602 TT CC Vitamin B12 deficiency FUT2 Normal levels of vita +min B12 Found 6 matches 23:14 >

      Hope that helps,

      Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,