in reply to Re: storing hash in temporary files to save memory usage in thread storing hash in temporary files to save memory usage
I have the following code to compare the sequence in multiple large files. The below code works fine with some files but I want it to execute for any number of files however large it may be. I tried executing it with more than 10GB data but the program gets killed. Example of a file is shown below. The count is summed only if the second line of each set matches in both files. I want to give the sum of those set whose second line matches in all the files.
data1.txt
@NS500278
AGATCNGAA
+
=CCGGGCGG 1
@NS500278
TACAGNGAG
+
CCCGGGGGG 2
@NS500278
CATTGNACC
+
CCCGGGGGG 3
data2.txt
@NS500278
AGATCNGAA
+
=CCGGGCGG 1
@NS500278
CATTGNACC
+
CCCG#GGG# 2
@NS500278
TACAGNGAG
+
CC=GGG#GG 2
output:
@NS500278
AGATCNGAA
+
=GGGGGCCG 1:data1.txt.out 1:data2.txt.out count:2
@NS500278
CATTGNACC
+
CCCGGGGGG 3:data1.txt.out 2:data2.txt.out count:5
@NS500278
TACAGNGAG
+
CCCGGGGGG 2:data1.txt.out 2:data2.txt.out count:4
My code is:
#!/usr/bin/env perl
use strict;
use warnings;
my %seen;
$/ = "";
while (<>) {
chomp;
my ($key, $value) = split ('\t', $_);
my @lines = split /\n/, $key;
my $key1 = $lines[1];
$seen{$key1} //= [ $key ];
push (@{$seen{$key1}}, $value);
}
foreach my $key1 ( sort keys %seen ) {
my $tot = 0;
my $file_count = @ARGV;
for my $val ( @{$seen{$key1}} ) {
$tot += ( split /:/, $val )[0];
}
if ( @{ $seen{$key1} } >= $file_count) {
print join( "\t", @{$seen{$key1}});
print "\tcount:". $tot."\n\n";
}
}
Re^3: storing hash in temporary files to save memory usage
by BrowserUk (Patriarch) on Sep 21, 2016 at 06:20 UTC
|
I believe I have a solution for your problem that should work with any size files and will finish in less time than it will take to load your data into a RDBMS.
I need confirmation that all your keys (dna sequences) 1) are 9 characters long? 2) consist of combinations of the 5 characters: ACGTN?
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] |
|
| [reply] |
|
data1.txt
@NS500278
AGATCNGAA ********
+
=CCGGGCGG 1
@NS500278
TACAGNGAG ********
+
CCCGGGGGG 2
but they are 150 chars long? (Even though you've shown them as 9!?)
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] |
|
|
|
|
| [reply] |
|
You didn't so much "say something wrong" as display a pattern of responses that when allied with your anonymonk status lead me to believe that any time expended attempting to help you would probably be wasted.
- First you ask a very generic question that can be answered by typing simple queries into cpan:
tie+hash, DB hash, file hash.
And no description of the application.
- When prompted you provide some details.
But these turn out to be completely unrepresentative of the actual data you are working with.
- Even the details you provided are reluctant & sketchy."compare the sequence in multiple large files.... execute for any number of files however large ... more than 10GB data ...".
Is that 10GB per file, or across the multiple files? If the latter, how many files is "many"?
- And finally, you seem to have unrealistic expectations and be more concerned with being provided with a totally generic solution rather than looking to solve a specific problem.
Your "sample data" showed 9 character keys drawn from a 5 character alphabet. Each of those 5 characters can be represented by a 3-bit pattern and 9x3=27 bits, thus the keys can be perfectly hashed -- uniquely identified -- by a single 32-bit number. With 5**9=1953125 unique keys, you could collate your entire datasets, regardless of the number and size of files, using a single, in-memory array of just under 2 million counts. A simple, very fast, single pass, and the task is complete in an hour or two.
But, you eventually identify that your keys are actually 150 characters: that would make for 450-bit or 57-byte perfect hash. Ie. 7.0064923216240853546186479164496e+75 possible keys. Let me try and make sense of that number for you: 700649232162408535461864791644958065640130970938257885878534141944895541342930300743319094181060791015625
If there are the estimated 1 billion trillion stars in the universe, and each of them had 10 earth-like planets around them, and each of those had the 7.5 quintillion grains of sand that it is estimated there are on earth, and they were each converted into a memory chip of 1GB, there would still be only 1 / 934198976216544713949153055 the amount of memory required to apply the solution I was offering -- based on the information provided to that point -- to your actual dataset.
So you see, there is little point in pursuing the solution I was offering, given the nature of your actual data.
What else can you do?
If this was a one-off process, you could try the methods identified by others in this thread.
- Sorting 10GB of data using a disk-based merge sort won't be fast, but it will eventually finish, and a single interleaved pass over your files will yield the counts you are after without requiring file-based hashes or huge memory.
Of course, if each of your "multiple files" is 10GB, and you need to sort them all ....
And if, as implied, you will need to be doing this process over and over again on many equally huge datasets, that sorting will likely become a bottleneck in your research.
- You could certainly try putting your dataset into sqllite.
But my experiences of using that -- remarkable, on small datasets -- tool on similar datasets of just 1GB have proved to my satisfaction that it isn't really up to the task.
And remember, it doesn't play well with concurrent access; so it takes the simple expedient of using multi-tasking -- via threads or processes -- to quarter or eighth or 16th your runtimes, right off the table.
- Using a real RDBMS (say postgresql), would bring the multi-tasking possibility back.
But still the fundamental underlying problem of providing efficient access to such a potentially vast range of keys remains. Indexing does nothing if the index field is 150 characters. Postgres (and other serious RDBMSs) have some specialist proprietary methods for dealing with such data, but you will have to research which ones to use and how to make best use of them.
And in the end, the process of loading up and indexing/hashing/whatever is required, large datasets, in order to perform one-off collating tasks like you've described, often takes longer on its own, than performing that one-off task itself using non-RDBMS methods.
Even the process of describing the -- essentially simple counting task -- in SQL terms of joins across multiple tables; and then tweaking and re-tweaking the schema and query to encourage SQL to perform that process efficiently, becomes quite complicated and requires intimate knowledge of the particular RDBMS and its proprietary extensions; which for one-off process becomes prohibitive.
And repeating it for different datasets -- with different key sizes and alphabets, different numbers of tables, perhaps variations in original data formats -- only to discard everything once the task is done ....
- I'm not sure what applicability DBD::CSV has to a dataset that consists of multi-line records with no commas or other obvious field delimiter.
How would I attempt the task as now described?
Finally, on the basis of the information you have provided -- so far, assuming there are no further twists and turns -- the method I would probably use for your task is an extension of the one I was offering above.
You do one pass over your files, generate perfect hash number from the first 10 characters of your dna, using the 3-bits per char method (10x3=30, still fits into 32-bit integer), and against these perfect hash values you store the fileID and byte offset of the record. This will effectively insertion sort your dataset into 5**10=9,765,625 subsets that share the same first 10 chars.
Assuming 10GB represents the combined size of your files, and guestimating that your individual multi-line records are 300 chars, that gives an estimate of 36 million total records. Split that into just under 10 million subsets gives an average of just 4 records per subset.
Having performed one fast pass over the dataset to find the subsets, a second pass over the subsets, fetching -- by direct random access -- each of the keys within a subset for full comparison will again be a relatively fast process as it requires only reads.
The whole process should take less than a couple of hours which is probably less time than it would take to load the dataset into a DB; and certainly less time than sorting it completely using your systems sort utility.
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] |
|
23/09/2016 20:18 2,737,000,000 data1.txt
23/09/2016 20:18 2,737,000,000 data2.txt
23/09/2016 20:18 2,737,000,000 data3.txt
23/09/2016 20:18 2,737,000,000 data4.txt
4 File(s) 10,948,000,000 bytes
That vaguely approximated your described data: C:\test\10gb>head data1.txt
@NS500278
GCGCCGGCGGCCATAGGGGCTGGTCAGTAAGAAGTAATGTCCCCAGCTGATTCGGATGGTGCGAATAAGG
+TCTTATCACTTACCCAAAGTATCTGGCTCTATATTGAGATACCGGAAGCCCCTTGTTGGTACTATGTGA
+TCGATATTTCT
+
=G=C=TATAAA=TGGTTG==C=TAT=GT=C==TAA==CCTA=AAACTTAAG==T=TTTGTCCAGGTAGCG
+TC=CCA=TG=TA=CT=TT=CC=AA==TCA=ACCGTCGTGGCC=A==GAA=AT=TCAGGCC=CC=GCTAG
+TG=CCA===TA 1
@NS500278
GTATACCGGTTGAAACAACACGAGAACGAAAGGGTCGCGACTCCATCATAAGGCTAGAAAACCAATTGTA
+TGGATCTGAACATGTTGTGGCGTTACGCGGAACTCCCTGGCTAAAGTGAGACGATTATCAATAGAAAGC
+AAACTCTACGT
+
AAAC=A=CTC=ATCAA==TGATACCGACCT=CCG=CCAG=CC=G=TC=TGGTG=GAGGTAGGT=AAC=GC
+ACA=CC==GTGGACCCTA=C=C=CG==TA=G=CCA=G==CTGGCGTT=CCGAACAGGC===ATCGATC=
+=TTTCTTTTTT 2
I implemented the code I described under How would I attempt the task as now described? in 55 lines, and ran it: C:\test\10gb>..\1172352 *.txt > combined.out
Pass one found 1048269 groups of records in 4 files in 361.736613 seco
+nds
Total time:4826.534832 seconds
So 1 hour 20 minutes. My "couple of hours" gestimate wasn't too far wrong.
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] [select] |
|
|