Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

Re^2: storing hash in temporary files to save memory usage

by Anonymous Monk
on Sep 21, 2016 at 05:46 UTC ( [id://1172271]=note: print w/replies, xml ) Need Help??


in reply to Re: storing hash in temporary files to save memory usage
in thread storing hash in temporary files to save memory usage

I have the following code to compare the sequence in multiple large files. The below code works fine with some files but I want it to execute for any number of files however large it may be. I tried executing it with more than 10GB data but the program gets killed. Example of a file is shown below. The count is summed only if the second line of each set matches in both files. I want to give the sum of those set whose second line matches in all the files.

data1.txt @NS500278 AGATCNGAA + =CCGGGCGG 1 @NS500278 TACAGNGAG + CCCGGGGGG 2 @NS500278 CATTGNACC + CCCGGGGGG 3 data2.txt @NS500278 AGATCNGAA + =CCGGGCGG 1 @NS500278 CATTGNACC + CCCG#GGG# 2 @NS500278 TACAGNGAG + CC=GGG#GG 2 output: @NS500278 AGATCNGAA + =GGGGGCCG 1:data1.txt.out 1:data2.txt.out count:2 @NS500278 CATTGNACC + CCCGGGGGG 3:data1.txt.out 2:data2.txt.out count:5 @NS500278 TACAGNGAG + CCCGGGGGG 2:data1.txt.out 2:data2.txt.out count:4
My code is:
#!/usr/bin/env perl use strict; use warnings; my %seen; $/ = ""; while (<>) { chomp; my ($key, $value) = split ('\t', $_); my @lines = split /\n/, $key; my $key1 = $lines[1]; $seen{$key1} //= [ $key ]; push (@{$seen{$key1}}, $value); } foreach my $key1 ( sort keys %seen ) { my $tot = 0; my $file_count = @ARGV; for my $val ( @{$seen{$key1}} ) { $tot += ( split /:/, $val )[0]; } if ( @{ $seen{$key1} } >= $file_count) { print join( "\t", @{$seen{$key1}}); print "\tcount:". $tot."\n\n"; } }

Replies are listed 'Best First'.
Re^3: storing hash in temporary files to save memory usage
by BrowserUk (Patriarch) on Sep 21, 2016 at 06:20 UTC

    I believe I have a solution for your problem that should work with any size files and will finish in less time than it will take to load your data into a RDBMS.

    I need confirmation that all your keys (dna sequences) 1) are 9 characters long? 2) consist of combinations of the 5 characters: ACGTN?


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
    In the absence of evidence, opinion is indistinguishable from prejudice.

      All dna sequences are of equal length but 150 characters long. may be sometimes we have data of more characters but all will be of same length.

      yes they consist of combinations of AGTCN

        The indicated lines below are your keys for matching:

        data1.txt @NS500278 AGATCNGAA ******** + =CCGGGCGG 1 @NS500278 TACAGNGAG ******** + CCCGGGGGG 2

        but they are 150 chars long? (Even though you've shown them as 9!?)


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
        In the absence of evidence, opinion is indistinguishable from prejudice.

      I am really very sorry if I said something wrong. I just mentioned that the length should not be fixed because you asked me about the length of dna sequences. My problem is to find a way which can reduce memory consumption so that my code works with large multiple files and it does not get killed. Any help will be appreciated.

        You didn't so much "say something wrong" as display a pattern of responses that when allied with your anonymonk status lead me to believe that any time expended attempting to help you would probably be wasted.

        1. First you ask a very generic question that can be answered by typing simple queries into cpan: tie+hash, DB hash, file hash.

          And no description of the application.

        2. When prompted you provide some details.

          But these turn out to be completely unrepresentative of the actual data you are working with.

        3. Even the details you provided are reluctant & sketchy."compare the sequence in multiple large files.... execute for any number of files however large ... more than 10GB data ...".

          Is that 10GB per file, or across the multiple files? If the latter, how many files is "many"?

        4. And finally, you seem to have unrealistic expectations and be more concerned with being provided with a totally generic solution rather than looking to solve a specific problem.

        Your "sample data" showed 9 character keys drawn from a 5 character alphabet. Each of those 5 characters can be represented by a 3-bit pattern and 9x3=27 bits, thus the keys can be perfectly hashed -- uniquely identified -- by a single 32-bit number. With 5**9=1953125 unique keys, you could collate your entire datasets, regardless of the number and size of files, using a single, in-memory array of just under 2 million counts. A simple, very fast, single pass, and the task is complete in an hour or two.

        But, you eventually identify that your keys are actually 150 characters: that would make for 450-bit or 57-byte perfect hash. Ie. 7.0064923216240853546186479164496e+75 possible keys. Let me try and make sense of that number for you: 700649232162408535461864791644958065640130970938257885878534141944895541342930300743319094181060791015625

        If there are the estimated 1 billion trillion stars in the universe, and each of them had 10 earth-like planets around them, and each of those had the 7.5 quintillion grains of sand that it is estimated there are on earth, and they were each converted into a memory chip of 1GB, there would still be only 1 / 934198976216544713949153055 the amount of memory required to apply the solution I was offering -- based on the information provided to that point -- to your actual dataset.

        So you see, there is little point in pursuing the solution I was offering, given the nature of your actual data.

        What else can you do?

        If this was a one-off process, you could try the methods identified by others in this thread.

        • Sorting 10GB of data using a disk-based merge sort won't be fast, but it will eventually finish, and a single interleaved pass over your files will yield the counts you are after without requiring file-based hashes or huge memory.

          Of course, if each of your "multiple files" is 10GB, and you need to sort them all ....

          And if, as implied, you will need to be doing this process over and over again on many equally huge datasets, that sorting will likely become a bottleneck in your research.

        • You could certainly try putting your dataset into sqllite.

          But my experiences of using that -- remarkable, on small datasets -- tool on similar datasets of just 1GB have proved to my satisfaction that it isn't really up to the task.

          And remember, it doesn't play well with concurrent access; so it takes the simple expedient of using multi-tasking -- via threads or processes -- to quarter or eighth or 16th your runtimes, right off the table.

        • Using a real RDBMS (say postgresql), would bring the multi-tasking possibility back.

          But still the fundamental underlying problem of providing efficient access to such a potentially vast range of keys remains. Indexing does nothing if the index field is 150 characters. Postgres (and other serious RDBMSs) have some specialist proprietary methods for dealing with such data, but you will have to research which ones to use and how to make best use of them.

          And in the end, the process of loading up and indexing/hashing/whatever is required, large datasets, in order to perform one-off collating tasks like you've described, often takes longer on its own, than performing that one-off task itself using non-RDBMS methods.

          Even the process of describing the -- essentially simple counting task -- in SQL terms of joins across multiple tables; and then tweaking and re-tweaking the schema and query to encourage SQL to perform that process efficiently, becomes quite complicated and requires intimate knowledge of the particular RDBMS and its proprietary extensions; which for one-off process becomes prohibitive.

          And repeating it for different datasets -- with different key sizes and alphabets, different numbers of tables, perhaps variations in original data formats -- only to discard everything once the task is done ....

        • I'm not sure what applicability DBD::CSV has to a dataset that consists of multi-line records with no commas or other obvious field delimiter.

        How would I attempt the task as now described?

        Finally, on the basis of the information you have provided -- so far, assuming there are no further twists and turns -- the method I would probably use for your task is an extension of the one I was offering above.

        You do one pass over your files, generate perfect hash number from the first 10 characters of your dna, using the 3-bits per char method (10x3=30, still fits into 32-bit integer), and against these perfect hash values you store the fileID and byte offset of the record. This will effectively insertion sort your dataset into 5**10=9,765,625 subsets that share the same first 10 chars.

        Assuming 10GB represents the combined size of your files, and guestimating that your individual multi-line records are 300 chars, that gives an estimate of 36 million total records. Split that into just under 10 million subsets gives an average of just 4 records per subset.

        Having performed one fast pass over the dataset to find the subsets, a second pass over the subsets, fetching -- by direct random access -- each of the keys within a subset for full comparison will again be a relatively fast process as it requires only reads.

        The whole process should take less than a couple of hours which is probably less time than it would take to load the dataset into a DB; and certainly less time than sorting it completely using your systems sort utility.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
        In the absence of evidence, opinion is indistinguishable from prejudice.

        For my own satisfaction, I generated just over 10GB of data spread across 4 files:

        23/09/2016 20:18 2,737,000,000 data1.txt 23/09/2016 20:18 2,737,000,000 data2.txt 23/09/2016 20:18 2,737,000,000 data3.txt 23/09/2016 20:18 2,737,000,000 data4.txt 4 File(s) 10,948,000,000 bytes

        That vaguely approximated your described data:

        C:\test\10gb>head data1.txt @NS500278 GCGCCGGCGGCCATAGGGGCTGGTCAGTAAGAAGTAATGTCCCCAGCTGATTCGGATGGTGCGAATAAGG +TCTTATCACTTACCCAAAGTATCTGGCTCTATATTGAGATACCGGAAGCCCCTTGTTGGTACTATGTGA +TCGATATTTCT + =G=C=TATAAA=TGGTTG==C=TAT=GT=C==TAA==CCTA=AAACTTAAG==T=TTTGTCCAGGTAGCG +TC=CCA=TG=TA=CT=TT=CC=AA==TCA=ACCGTCGTGGCC=A==GAA=AT=TCAGGCC=CC=GCTAG +TG=CCA===TA 1 @NS500278 GTATACCGGTTGAAACAACACGAGAACGAAAGGGTCGCGACTCCATCATAAGGCTAGAAAACCAATTGTA +TGGATCTGAACATGTTGTGGCGTTACGCGGAACTCCCTGGCTAAAGTGAGACGATTATCAATAGAAAGC +AAACTCTACGT + AAAC=A=CTC=ATCAA==TGATACCGACCT=CCG=CCAG=CC=G=TC=TGGTG=GAGGTAGGT=AAC=GC +ACA=CC==GTGGACCCTA=C=C=CG==TA=G=CCA=G==CTGGCGTT=CCGAACAGGC===ATCGATC= +=TTTCTTTTTT 2

        I implemented the code I described under How would I attempt the task as now described? in 55 lines, and ran it:

        C:\test\10gb>..\1172352 *.txt > combined.out Pass one found 1048269 groups of records in 4 files in 361.736613 seco +nds Total time:4826.534832 seconds

        So 1 hour 20 minutes. My "couple of hours" gestimate wasn't too far wrong.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
        In the absence of evidence, opinion is indistinguishable from prejudice.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1172271]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others browsing the Monastery: (2)
As of 2024-04-24 17:51 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found