Re: Matching complementary base pairs from 2 different files

in reply to Matching complementary base pairs from 2 different files

Hello Meetali16, and welcome to the Monastery!

I assume that the “Result” you give is the output you want to receive? If so, it’s difficult to see why some matches are included and others excluded. Looking at the rsid fields doesn’t help, since all lines in both input files have the same rsid. Matching base pairs between the two files would then result in 6 matches:

rs492602 CC GG Vitamin B12 deficiency    FUT2    Higher levels of vita
+min B12
rs492602 CC CC Vitamin B12 deficiency    FUT2    Higher levels of vita
+min B12
rs492602 CT GG Vitamin B12 deficiency    FUT2    Normal levels of vita
+min B12
rs492602 CT CC Vitamin B12 deficiency    FUT2    Normal levels of vita
+min B12
rs492602 TT GG Vitamin B12 deficiency    FUT2    Normal levels of vita
+min B12
rs492602 TT CC Vitamin B12 deficiency    FUT2    Normal levels of vita
+min B12
[download]

— unless the asterisks are significant? Please clarify.

In the meantime, I will propose a general strategy: Decide which of the two input files is likely to be shorter, and read the contents of that file into a hash. (The format for the hash will depend on the type of matching you require.) Then read the larger file, line by line, extracting its rsid and base pair fields and matching against the appropriate fields in the hash. Hash lookup is one of the areas where Perl really shines.

I notice you increment $i on each match, but never use it. I’m guessing you want to:

print "Found $i matches\n";
[download]

at the end of the script?

Please clarify what you are trying to achieve, to make it easier for the Monks to help you — and please remember that most of us are not biologists!

Cheers,

Athanasius <°(((>< contra mundum Iustus alius egestas vitae, eros Piratica,

Comment on Re: Matching complementary base pairs from 2 different files Select or Download Code

Replies are listed 'Best First'.

Re^2: Matching complementary base pairs from 2 different files
by Meetali16 (Novice) on Jan 01, 2018 at 10:42 UTC

no the asterisks are not significant, they were just to highlight the area of focus

[reply]

Re^3: Matching complementary base pairs from 2 different files

by Athanasius (Archbishop) on Jan 01, 2018 at 13:16 UTC

In that case, the following should be close to what you’re looking for:

use strict;
use warnings;

open(my $f1, '<', $ARGV[0])
    or die "Could not open '$ARGV[0]' for reading, stopped";

open(my $f2, '<', $ARGV[1])
    or die "Could not open '$ARGV[1]' for reading, stopped";

my @arr1 = <$f1>;
my @arr2 = <$f2>;

close $f1
    or die "Could not close '$ARGV[0]', stopped";

close $f2
    or die "Could not close '$ARGV[1]', stopped";

my %hash;

for my $element (@arr2)
{
    push @{ $hash{$1} }, $2
        if $element =~ m{ ^ \s* (rs\d+) \s+ \** (\w{2}) }x;
}

my $i = 0;

for my $element (@arr1)
{
    if (my ($rsid, $base1, $description) =
        $element =~ m{ ^ \s* (rs\d+) \s+ \** (\w{2}) \** \s+ (.*) $ }x
+)
    {
        if (exists $hash{$1})
        {
            for my $base2 (@{ $hash{$1} })
            {
                print "$rsid $base1 $base2 $description\n";
                ++$i;
            }
        }
    }
}

print "Found $i matches\n";
[download]

Output:

23:14 >perl 1852_SoPW.pl file1 file2
rs492602 CC GG Vitamin B12 deficiency    FUT2    Higher levels of vita
+min B12
rs492602 CC CC Vitamin B12 deficiency    FUT2    Higher levels of vita
+min B12
rs492602 CT GG Vitamin B12 deficiency    FUT2    Normal levels of vita
+min B12
rs492602 CT CC Vitamin B12 deficiency    FUT2    Normal levels of vita
+min B12
rs492602 TT GG Vitamin B12 deficiency    FUT2    Normal levels of vita
+min B12
rs492602 TT CC Vitamin B12 deficiency    FUT2    Normal levels of vita
+min B12
Found 6 matches

23:14 >
[download]

Hope that helps,

Athanasius <°(((>< contra mundum Iustus alius egestas vitae, eros Piratica,

[reply]
[d/l]
[select]

In Section Seekers of Perl Wisdom