Re^4: write to Disk instead of RAM without using modules

I have multiple fastq files in the following format. I want to print the total count if the second line i.e the sequence matches in all files.

R1.txt

@NS500278:42:HC7M3AFXX:3:21604:26458:18476 2:N:0:AGTGGTCA
AAAAAAAAACAGATATTTGCACTAGGCATTATAAATAACATCAATTAAGTAAAAAAATTA
+
AAAAAEEEEAEEEEEEEEEE/AEEEEEEEEEEEE
     1:R1.txt
[download]

R2.txt

@NS500278:42:HC7M3AFXX:3:21604:26458:18476 2:N:0:AGTGGTCA
AAAAAAAAACAGATATTTGCACTAGGCATTATAAATAACATCAATTAAGTAAAAAAATTA
+
AAAAAEEEEAEEEEEEEEEE
     1:R2.txt
[download]

The output I want is: 

output

@NS500278:42:HC7M3AFXX:3:21604:26458:18476 2:N:0:AGTGGTCA
AAAAAAAAACAGATATTTGCACTAGGCATTATAAATAACATCAATTAAGTAAAAAAATTA
+
AAAAAEEEEAEEEEEEEEEE/AEEEEEEEEEEEE
     1:R1.txt         1:R2.txt      count:2
[download]


My code is:

#!/usr/bin/env perl
use strict;
use warnings;
no warnings qw( numeric );
my %seen;

$/ = "";
while (<>) {
    chomp;
    my ($key, $value) = split ('\t', $_);

    my @lines = split /\n/, $key;
    my $key1 = $lines[1];

    $seen{$key1} //= [ $key ];
    push (@{$seen{$key1}}, $value);
}

foreach my $key1 ( sort keys %seen ) {
my $tot = 0;
my $file_count = @ARGV;
for my $val ( @{$seen{$key1}} ) {
        $tot += ( split /:/, $val )[0];
    }   
    
if ( @{ $seen{$key1} } >= $file_count) {


        print join( "\t", @{$seen{$key1}});
        print "\tcount:". $tot."\n\n";
    }
}
[download]

This is working well with some files but when I compare more files it hangs. I think it is because of memory issue. Without using any modules I want to modify this script so that it can work with any number of files.

Comment on Re^4: write to Disk instead of RAM without using modules Select or Download Code

Replies are listed 'Best First'.
Re^5: write to Disk instead of RAM without using modules by hippo (Bishop) on Oct 24, 2016 at 08:32 UTC
This is working well with some files but when I compare more files it hangs. I think it is because of memory issue. Hangs or just runs more slowly? How many more files? What total size of files? How many records? How many bytes? How much RAM is available? I want to modify this script so that it can work with any number of files. So, after the second and each subsequent file is read just rip through the structure and remove any entries where the number of matches is less than the number of files read so far. That will reduce the footprint. Of course, if your original guess is wrong and this is nothing to do with memory constraints that will not help. Update (26th of October): Beware that the stipulation in the parent that "I want to print the total count if the second line i.e the sequence matches in all files." now turns out not to be the case. Depending on the matches this now sounds like an O(N**2) problem.	[reply]
Re^5: write to Disk instead of RAM without using modules by Laurent_R (Canon) on Oct 24, 2016 at 17:10 UTC
If I understand correctly, you need to load only the first file in memory (in your hash). Then you can read sequentially every other file one after the other and for, each record in such files, check if the record was in the first file, and add 1 to the value if it is there (and print anything to a new file if needed). After having read one file, remove the hash records which have not been updated. Then proceed the same way with the next file, and so on with all the other files. At the end, you hash will contain only the records which have been found in every single file. In brief, you only need to load the first file in memory, all the others can be read sequentially.	[reply]
Re^6: write to Disk instead of RAM without using modules by Anonymous Monk on Oct 25, 2016 at 07:54 UTC
I am unable to understand how to load first file in memory and sequentially read other files. can you please help me with my script or exemplify with a short script. Any help will be appreciated.	[reply]
Re^7: write to Disk instead of RAM without using modules by Corion (Patriarch) on Oct 25, 2016 at 07:58 UTC
The following code reads a file line by line: `my $filename = 'some/filename.txt'; open my $fh, '<', $filename or die "Couldn't read '$filename': $!"; while (<$fh>) { ... }` [download] If you want to store information about a file, do so while reading it line by line.	[reply] [d/l]
Re^8: write to Disk instead of RAM without using modules by Anonymous Monk on Oct 25, 2016 at 08:36 UTC
Re^9: write to Disk instead of RAM without using modules by Corion (Patriarch) on Oct 25, 2016 at 08:37 UTC
Re^9: write to Disk instead of RAM without using modules by Anonymous Monk on Oct 25, 2016 at 09:06 UTC
Some notes below your chosen depth have not been shown here
Re^9: write to Disk instead of RAM without using modules by Anonymous Monk on Oct 25, 2016 at 10:33 UTC
Some notes below your chosen depth have not been shown here
Re^5: write to Disk instead of RAM without using modules by BrowserUk (Patriarch) on Oct 26, 2016 at 13:08 UTC
The code you've posted is rubbish. Ie. If you need to slurp files, then you should be setting `$/ = undef` not `$/ = "";`. This only "works" for your files by blind luck. Having slurped the entire file into a string, you then chomp it. Except chomp removes the current value of `$/` from the end of the string. As you have `$/ = "";`, this does nothing. You then do `my ($key, $value) = split ('\t', $_);` But the files do not contain any tabs, so the result is that you've copied the entire file into `$key` and set `$value` to undef. You then split the file to an array of lines in order to pick out the sequence that you use as your `$key1`... Laborious, but okay. Then you do `$seen{$key1} //= [ $key ];`. Ie. You store a string, containing the entire file contents, in an anonymous array, and store that as the value indexed by the sequence. Why? Why store the entire contents of all the files, when you could read them back from disk at any time? Then you do `push (@{$seen{$key1}}, $value);`. But as explained above, `$value` will always be undef. What you are doing is storing the contents of all your files, and using arrays of undefs as a mechanism to count how many of those files each sequence appears in. And you wonder why you are running out of space! And now you want to write that hash to disk, to avoid running out of memory! That's a really silly idea when most of the contents of that hash are already stored on disk in the files you are reading! Why not just store the name of the file and read it again when you need it? And increment an integer value for each file containing the sequence? That would reduce the memory requirements of your application to ~300 bytes per unique sequence, regardless of the number of files they are in. Which means that a typical 8GB system would be able to handle at least 20 million files (if they were all unique) and any number of duplicates without running out of memory. All in all, the standard of the code you posted, and your desire to work around the self-inflicted problems it contains by writing your hash to disk -- without using modules -- is a pretty clear indication that you need to take a programming course or two, or find someone local to you to help you over the learning curve. Asking stranger's on the internet to do your job for you isn't going to fly. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :) In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l] [select]


We don't bite newbies here... much
	PerlMonks