Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw

Comment on

( #3333=superdoc: print w/replies, xml ) Need Help??

DavidO’s suggestion seems the most-appropriate one to me ... such that “out of memory” ought never be an issue here.

If the file is known to be sorted by filename, then the entire set of records for any given name is known to be adjacent, and you only need to remember “the last name seen” and a hash to accumulate and/or to count extensions.   When the name changes, the results for the previous current-name are output, the new current-name is captured, and the hash is reset ... taking proper care to ensure that nothing is lost from either group.

As an added bonus, the output file will always emerge sorted the same way.   So, you can devise an entire processing sequence that relies upon sort-order, and every single one of them might be “very fast and virtually RAM-free.”   You can furthermore rely upon the fact that, for any gaps that you detect in the sequence, there will be no records anywhere in the file that fall within that gap.

This is a very powerful, and very old, technique that was used as far back as Herman Hollerith’s tabulation of the US Census using punched cards.   Especially when large amounts of data had to be processed on machines whose memory capacity was all but non-existent, the so-called “unit-record processing” was the only way such jobs could be done.   Even today, the assumption (and maintaining) of a certain known sort-order is a powerful technique that is ideal for an application such as this.   If you know that the file is already sorted, you don’t need much memory at all.   (If you knew that it was sorted both by name and extension ... and you should check for this, because there’s a pretty good chance that it’s so!! ... your problem might require almost-no memory at all:   the files are adjacent, and within each file, so are the extensions.)

When you are dealing with data files of this magnitude, and especially if you need to do several things to the data in successive steps, it might (or, might not) be justifiable to sort the data, just to pick-up an accumulation of efficiencies downstream.   (The only way to establish that is to empirically test it, on your own hardware and network.)   In any case, if the producer of that file gave it to you with a guaranteed sort-order, they just dealt you a huge favor.

In any case, your code must be defensive:   if your algorithm fundamentally relies upon the proposition that the data is sorted, you must not assume.   You’re looking for the key-values to change, and that the difference between old and new is “greater than.”   If it is not, your code must detect it and die ... because the results won’t be reliable and no one else is in the position to know.

Note: On a Unix-ish system, the shell command sort -c filename can be used to quickly discover, once and for all, whether the entire file is or is not sorted.   (It can also be used as a quick “sanity-check” in a script that invokes this Perl program, since it does understand space-delimited files and can be instructed to consider the entire file or only a particular key or keys.   “Sniff” the data at the start of the job, and sound the alarm before eating spoiled meat, because any human can make a mistake.)

In reply to Re: Large file, multi dimensional hash - out of memory by sundialsvc4
in thread Large file, multi dimensional hash - out of memory by Anonymous Monk

Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":

  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?

    What's my password?
    Create A New User
    and all is quiet...

    How do I use this? | Other CB clients
    Other Users?
    Others drinking their drinks and smoking their pipes about the Monastery: (6)
    As of 2018-03-21 22:50 GMT
    Find Nodes?
      Voting Booth?
      When I think of a mole I think of:

      Results (272 votes). Check out past polls.