Beefy Boxes and Bandwidth Generously Provided by pair Networks kudra
There's more than one way to do things
 
PerlMonks  

Re^2: Working on huge (GB sized) files

by sundialsvc4 (Monsignor)
on May 12, 2011 at 16:47 UTC ( #904463=note: print w/ replies, xml ) Need Help??


in reply to Re: Working on huge (GB sized) files
in thread Working on huge (GB sized) files

... yes, and if the number of records is huge, such that the memory consumption of in-memory data structures is problematic (highly unlikely, these days...), you can also throw the keys into separate files, sort the two files identically using an external disk sort, and process the two streams in a single sequential pass ... a classic “merge” operation.   (What worked in the days of COBOL still works today.)

If the “huge files” happen to be XML files, modules such as XML::Twig are designed to work with files even of that size, without overwhelming memory.


Comment on Re^2: Working on huge (GB sized) files
Re^3: Working on huge (GB sized) files
by BrowserUk (Pope) on May 12, 2011 at 18:40 UTC
    you can also throw the keys into separate files, sort the two files identically using an external disk sort, and process the two streams in a single sequential pass ... a classic “merge” operation. (What worked in the days of COBOL still works today.)

    "Works" yes. Efficient? No.

    That is exactly the "2 x O(NlogN) sorts + an O(N) merge" process I was decrying.

    Let's say that each file consists of 100 million records.

    1. 2 * ( 1e8 * log( 1e8 ) ) + 1e8 = 1.7 billion operations.
    2. 3 * 1e8                        = 0.3 billion operations.

    So, theoretically 6 times longer. And in reality, usually 15 or 20 times longer, because the two sorts themselves involve multiple passes and a merge phase not taken into consideration by the above simplistic equations.

    And given that many of those operations are I/O, and therefore still measured in milliseconds (little different to how they were in the '70s) rather than nanoseconds now for cpu ops (which took 10s or 100s of microseconds back in the '70s), it is easy to see that it is worth expending a lot of effort (human ingenuity and cpu cycles) to avoid moving to the disk-based approach.

    If you perform the merge both ways--hash&memory .v. disksort&merge--of two files large enough that the former just fits within your given memory limit, then typically the latter will take close to 20 times longer. Then when faced with the situation where one more record pushes you beyond the memory limit, it is worth expending considerable effort to try and fit that extra record in memory and so avoid the 20x penalty.

    For example, perl's hashes are designed for very general purpose usage, and as such are quite memory hungry. It is quite easy to come up with a hash-like structure that supports the subset of operations required for this particular purpose whilst using substantially less memory in the process. And even if this is implemented in pure perl and therefore considerably slower than Perl's built-in hashes, if it allows you to fit that extra record (and achieving 10 times more is possible), then you will still have saved time by avoiding the 20 times slow down.

    Those old COBOL techniques worked and still do. But they were used back then because there was no feasible alternative, not because they were a good solution At a pinch, when all else fails, they are a fall-back position. A last resort, not a first.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://904463]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others cooling their heels in the Monastery: (10)
As of 2014-04-24 10:09 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    April first is:







    Results (565 votes), past polls