Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

Re^4: write to Disk instead of RAM without using modules

by Anonymous Monk
on Oct 24, 2016 at 07:17 UTC ( [id://1174575]=note: print w/replies, xml ) Need Help??


in reply to Re^3: write to Disk instead of RAM without using modules
in thread write to Disk instead of RAM without using modules

I have multiple fastq files in the following format. I want to print the total count if the second line i.e the sequence matches in all files.
R1.txt @NS500278:42:HC7M3AFXX:3:21604:26458:18476 2:N:0:AGTGGTCA AAAAAAAAACAGATATTTGCACTAGGCATTATAAATAACATCAATTAAGTAAAAAAATTA + AAAAAEEEEAEEEEEEEEEE/AEEEEEEEEEEEE 1:R1.txt
R2.txt @NS500278:42:HC7M3AFXX:3:21604:26458:18476 2:N:0:AGTGGTCA AAAAAAAAACAGATATTTGCACTAGGCATTATAAATAACATCAATTAAGTAAAAAAATTA + AAAAAEEEEAEEEEEEEEEE 1:R2.txt
The output I want is: output @NS500278:42:HC7M3AFXX:3:21604:26458:18476 2:N:0:AGTGGTCA AAAAAAAAACAGATATTTGCACTAGGCATTATAAATAACATCAATTAAGTAAAAAAATTA + AAAAAEEEEAEEEEEEEEEE/AEEEEEEEEEEEE 1:R1.txt 1:R2.txt count:2
My code is: #!/usr/bin/env perl use strict; use warnings; no warnings qw( numeric ); my %seen; $/ = ""; while (<>) { chomp; my ($key, $value) = split ('\t', $_); my @lines = split /\n/, $key; my $key1 = $lines[1]; $seen{$key1} //= [ $key ]; push (@{$seen{$key1}}, $value); } foreach my $key1 ( sort keys %seen ) { my $tot = 0; my $file_count = @ARGV; for my $val ( @{$seen{$key1}} ) { $tot += ( split /:/, $val )[0]; } if ( @{ $seen{$key1} } >= $file_count) { print join( "\t", @{$seen{$key1}}); print "\tcount:". $tot."\n\n"; } }
This is working well with some files but when I compare more files it hangs. I think it is because of memory issue. Without using any modules I want to modify this script so that it can work with any number of files.

Replies are listed 'Best First'.
Re^5: write to Disk instead of RAM without using modules
by hippo (Bishop) on Oct 24, 2016 at 08:32 UTC
    This is working well with some files but when I compare more files it hangs. I think it is because of memory issue.

    Hangs or just runs more slowly? How many more files? What total size of files? How many records? How many bytes? How much RAM is available?

    I want to modify this script so that it can work with any number of files.

    So, after the second and each subsequent file is read just rip through the structure and remove any entries where the number of matches is less than the number of files read so far. That will reduce the footprint. Of course, if your original guess is wrong and this is nothing to do with memory constraints that will not help.

    Update (26th of October): Beware that the stipulation in the parent that "I want to print the total count if the second line i.e the sequence matches in all files." now turns out not to be the case. Depending on the matches this now sounds like an O(N**2) problem.

Re^5: write to Disk instead of RAM without using modules
by Laurent_R (Canon) on Oct 24, 2016 at 17:10 UTC
    If I understand correctly, you need to load only the first file in memory (in your hash). Then you can read sequentially every other file one after the other and for, each record in such files, check if the record was in the first file, and add 1 to the value if it is there (and print anything to a new file if needed). After having read one file, remove the hash records which have not been updated. Then proceed the same way with the next file, and so on with all the other files.

    At the end, you hash will contain only the records which have been found in every single file.

    In brief, you only need to load the first file in memory, all the others can be read sequentially.

      I am unable to understand how to load first file in memory and sequentially read other files. can you please help me with my script or exemplify with a short script. Any help will be appreciated.

        The following code reads a file line by line:

        my $filename = 'some/filename.txt'; open my $fh, '<', $filename or die "Couldn't read '$filename': $!"; while (<$fh>) { ... }

        If you want to store information about a file, do so while reading it line by line.

Re^5: write to Disk instead of RAM without using modules
by BrowserUk (Patriarch) on Oct 26, 2016 at 13:08 UTC

    The code you've posted is rubbish. Ie.

    1. If you need to slurp files, then you should be setting $/ = undef not $/ = "";.

      This only "works" for your files by blind luck.

    2. Having slurped the entire file into a string, you then chomp it.

      Except chomp removes the current value of $/ from the end of the string. As you have $/ = "";, this does nothing.

    3. You then do my ($key, $value) = split ('\t', $_);

      But the files do not contain any tabs, so the result is that you've copied the entire file into $key and set $value to undef.

    4. You then split the file to an array of lines in order to pick out the sequence that you use as your $key1...

      Laborious, but okay.

    5. Then you do $seen{$key1} //= [ $key ];.

      Ie. You store a string, containing the entire file contents, in an anonymous array, and store that as the value indexed by the sequence.

      Why? Why store the entire contents of all the files, when you could read them back from disk at any time?

    6. Then you do push (@{$seen{$key1}}, $value);.

      But as explained above, $value will always be undef.

      What you are doing is storing the contents of all your files, and using arrays of undefs as a mechanism to count how many of those files each sequence appears in.

      And you wonder why you are running out of space!

    And now you want to write that hash to disk, to avoid running out of memory! That's a really silly idea when most of the contents of that hash are already stored on disk in the files you are reading!

    Why not just store the name of the file and read it again when you need it? And increment an integer value for each file containing the sequence?

    That would reduce the memory requirements of your application to ~300 bytes per unique sequence, regardless of the number of files they are in. Which means that a typical 8GB system would be able to handle at least 20 million files (if they were all unique) and any number of duplicates without running out of memory.

    All in all, the standard of the code you posted, and your desire to work around the self-inflicted problems it contains by writing your hash to disk -- *without using modules* -- is a pretty clear indication that you need to take a programming course or two, or find someone local to you to help you over the learning curve. Asking stranger's on the internet to do your job for you isn't going to fly.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
    In the absence of evidence, opinion is indistinguishable from prejudice.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1174575]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others lurking in the Monastery: (5)
As of 2024-04-19 20:07 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found