Can't access data stored in Hash

corcra has asked for the wisdom of the Perl Monks concerning the following question:

Hello Perl Monks,

I am a novice and would be grateful for your help. I tried to write a piece of code to compare two files which look like

File 1

['CHROM', 'POS', 'REF', 'ALT', 'LIST', 'SAMPLE_1A', 'SAMPLE_2A', 'SAMP
+LE_3A']
['M, '16', 'T', 'C', 'C', 'REF', 'C', 'REF']
['M', '381', 'T', 'A', 'A', T', 'REF', 'REF']
['M', '529', 'A', 'G', 'G', 'REF', 'G', 'REF']
[download]

File 2

and

['CHROM', 'POS', 'REF', 'ALT', 'LIST', 'SAMPLE_1B', 'SAMPLE_2B', 'SAMP
+LE_3B']
['M', '16', 'T', 'C', 'C', 'C', 'REF', 'REF']
['M', '381', 'T', 'A', 'A', 'A', 'REF', 'REF']
['M', '528', 'A', 'G', 'G', 'REF', 'REF', 'REF']
[download]

I am trying to write a code which prints out file 1 again but if the sample value is not 'REF', looks up file 2. If the corresponding file 2 value is 'REF' then print the original value appearing in file 1. If the corresponding value in file 1 is not 'REF' then print the value we find in file 2. I also wish to ignore positions in file 2 not present in file 1. Using the examples above, my output for the first two rows should look like

['CHROM', 'POS', 'REF', 'ALT', 'LIST', 'SAMPLE_1A', 'SAMPLE_2A', 'SAMP
+LE_3A']
['M, '16', 'T', 'C', 'C', 'REF', 'C', 'REF']
['M', '381', 'T', 'A', 'A', 'A', 'REF', 'REF']
[download]

The code I have written so far is:

#!/usr/bin/local/perl

use strict;
use warnings;

my %HashRef = ();
my $File1 = 'blah.txt;
my $File2 = 'moreblah.txt';
my $outfile = 'blank.txt';

open FILE2, "< $File2" or die "could not open tumour file...\n";
open FILE1, "< $File1" or die "could not open host file...\n";
open( my $out_fh, '>', $outfile ) or die "$!";

while (my $cols = <FILE2>
{
    chomp $cols;
        my @values = split ',', $cols;
    for my $i(5..$#values)
        {
            push( @{$HashRef{ $values[2] }}, $values[$i]);
        }

}

while (my $cols = <FILE1>)
{
    chomp $cols;
    my @values =  split ',', $cols;
    my @newarray = ($values[0], $values[1], $values[2], $values[3], $v
+alues[
4]);
     for my $j(5..$#values)
    {
        if($values[$j] =~ m/'REF'/)
        {
        push(@newarray, $values[$j]);
        }
        elsif( $HashRef{ $values[$j]} =~ m/'REF'/)
        {
        push(@newarray, $values[$j]);
        }
        else
        {
            push(@newarray, " REF ");
        }
    }
    say $out_fh @newarray;
}
close $out_fh;
close file1;
close file2;
[download]

I keep getting the error "Use of uninitialized value within %HashRef in pattern match (m//) at test.pl line 45, <FILE1> line 27". I tried storing the columns I wanted from file 2 in a hash in order to look them up but it just doesn't seem to be working. I have tried everything I can think of but at this stage I am just running into a brick wall. Please help! Any and all suggestions/corrections/criticisms are welcome!

Comment on Can't access data stored in Hash - help! Select or Download Code

Replies are listed 'Best First'.

Re: Can't access data stored in Hash - help!
by roboticus (Chancellor) on Aug 05, 2014 at 00:59 UTC

corcra:

A couple notes:

The "Use of uninitialized value" error may just be due to blank lines in (or at the end) of your file(s). Often, when I'm debugging file readers, I'll print the line read just before processing it during development, and print the values at various locations in the code.
Arrays in perl are 0-based. I mention that because it looks like you're using the third column of File2 as your line identifier, but your data indicates that the second column would be a better choice. (Again, printing the values in your loops would help you see that stuff.)

While I'm advocating that you add a few print statements during development, you don't need to leave them in when your code works. So don't be afraid to put in a few prints here and there. (I tend to use a prefix on the print strings to help me figure out what's printing and why. Something like:

while (my $line = <DATA>) {
   chomp $line;
   print "$.: $line\n";   ###
   my ($x, $y, $key, $bar) = split /,/, $line;
   print "A: x=$x, y=$y, k=$key, b=$bar.\n"; ###
   if ($key =~ /[^\d]/) {
      print "BOOM: non-numeric key value <$key>\n"; ###
      ... stuff ...
   }
   else {
      print "B: plover=...\n"; ###
      ... other stuff ...
   }
}
[download]

Putting the prefixes on your lines lets you use grep (or equivalent) to dig through your output to find lines of interest. For example, I may run my program and check for non-numeric keys like this:

perl foobar.pl >tmp
grep -E '^BOOM:' tmp | wc -l
[download]

When I'm happy with a section of code, I tend to comment out the print statements, and when it all works as I like it, I go back and delete them (the lines marked ###).

Finally, notice that after every variable in a print statement, I have a graphic character (., >, etc.) so I can see if the string has trailing blanks, carriage returns, etc.

...roboticus

When your only tool is a hammer, all problems look like your thumb.

[reply]
[d/l]
[select]

Re^2: Can't access data stored in Hash - help!

by corcra (Initiate) on Aug 05, 2014 at 08:51 UTC

roboticus, thank you for your suggestions! I will keep these in mind from now on.

[reply]

Re: Can't access data stored in Hash - help!
by Athanasius (Archbishop) on Aug 05, 2014 at 03:52 UTC

Hello corcra, and welcome to the Monastery!

You begin by reading the whole of file 2, and saving all the data you will need later when processing file 1. This is inefficient, and may be problematic if file 2 is large. A better strategy is to read the two input files together, one line at a time:

Read more... (2 kB)

Note: In the above code I’ve assumed that the data files are formatted as you’ve shown. But if (as I half suspect) they are actually formatted as proper CSV files, then you will be better served reading them with one of the modules designed for this purpose, such as Text::CSV_XS.

Hope that helps,

Athanasius <°(((>< contra mundum Iustus alius egestas vitae, eros Piratica,

[reply]
[d/l]
[select]

Re^2: Can't access data stored in Hash - help!

by corcra (Initiate) on Aug 05, 2014 at 10:16 UTC

Athanasius, this was a great help! Thank you. In fact, I expect that file 2 will be quite large so this method should work better.

I have an additional question related to this code but I don't know whether it should be posted here or as a new post. I will chance asking here and remove it if there is a problem. I want to ensure that for Sample_2A, Sample_2B, Sample_2C in file_1 each of these columns will be compared to Sample_2 in file 2 i.e. that columns with matching numbers are compared but I am not sure of the best way to do this but in the code you suggested it might be difficult to do this since the fields are broken up line-by-line

[reply]

Re^3: Can't access data stored in Hash - help!

by Athanasius (Archbishop) on Aug 05, 2014 at 12:56 UTC

Hello again, corcra,

I’m glad to have been of help.

If I understand you correctly, you now want read the data headings, say:

Field:    0            5            6            7            8
File 1: ['CHROM', ... 'SAMPLE_1A', 'SAMPLE_1B', 'SAMPLE_2A', 'SAMPLE_2
+B']
File 2: ['CHROM', ... 'SAMPLE_1',  'SAMPLE_2',  'SAMPLE_3']
[download]

and have the script deduce that File 1 data in fields 5 and 6 should each be compared to File 2 field 5, File 1 data in fields 7 and 8 should each be compared to File 2 field 6, and so on.

That makes the logic more complex, but I don’t know why you think this will be difficult to do line-by-line? Most of the added logic comes before the big while loop:


Just another Perl shrine
	PerlMonks