|Pathologically Eclectic Rubbish Lister|
Sorting data that don't fit in memoryby jeroenes (Priest)
|on Jan 24, 2001 at 13:07 UTC||Need Help??|
jeroenes has asked for the
wisdom of the Perl Monks concerning the following question:
Dear fellow monks,
I have a design/ tactical problem for you.
Intro (skippable)I'm analysing some large stretches of data at the moment. I have been writing some scripts to fetch data from a mere 2 gigabyte of text-files, writing needed info to a 220 meg binary file, and chopped that one in two. All sweet, but than I thought, well, it must be more efficient to make a sorted file with perl, and access the data via perl from matlab than to let matlab sort 'm. Just because the matrix is too large to fit in real memory, and I'm not allowed to buy more megs for my box.
ProblemSo I want to sort some 105 megs of binary data on a 16 bits integer. Record length is 8 bytes ('SLS'). When I just read it in, perl dies with 'Out of memory'. Rather strange, because I only have 13 M items, and 256M physical memory and 128 M swap. I guess the sort expands the int's to 8 byte floats, but even that should fit. However, I just accept that.
Solution 1: Changing the perl commandsHere is the code I use to sort, I hope you can pinpoint me a problem:
Solution 2: Sort in fileThis could be very inefficient, because you will have to read and write 100 meg chunks many times again.
Solution 2: Seek in fileMaybe I could use Search::binary to find ints of a certain value, and repeat that for all values of int in the file.
Solution 3: Use known distribution of intsI know how by approximation how the ints are distributed. So I can start a brand new file, fill in the expected boundaries, and move them as seems fit.
Solution 4: External module/ (C) libraryMaybe I'd better use some other library to sort my ints. They could be more memory efficient. But I'm really lost here. Couldn't find any reference to binary-sorting in CPAN or at perlmonks.org.
I hope you can appreciate this problem, and I'm curious to any idea's, suggestions, etc. Please point me some nice direction!
In high regard of you all,