Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Comment on

( #3333=superdoc: print w/ replies, xml ) Need Help??

First of all, what is the nature of the text file? Does it have repeated keys inside the same file?

Using hashes does consumes a lot of memory... but you can allways divide to conquer. ;-)

Suposing that you have two different files that you want to merge, you could try to reduce the file of each one looking for repeated keys and summing the numeric data.

Of course, this depends on the possibility to reduce the size of each file to an acceptable size. If this is not possible, you could consider working with slices of the file or using a database, since it will hold the data on the disc this should work. You can use any database, but DBD::DBM looks like ideal to your needs

What do you mean by saying "The cols are different in each file (but not always ...)"? This means the columns are in different places? Is easier to normalize that (put all columns and values in a previous defined position) and start working. For instance, once the files that the columns in the correct position, you can even forget about jumping over the first line: the program can easialy print the columns names in the output file later.

You're using hash references inside hashes... the more complicate structures you start using, the more memory the program will required.

Some other tips:

  1. If you're in a UNIX OS, then change the DOS new line characters (\r\n) to UNIX new lines (\n) before start processing the files. You can avoid using a regular expression and start using chomp to remove the new line
  2. Do not initialize an variable with my everytime the program enter in a loop. Initialize the variable before the loop and them, inside the loop, once the variable will not be used anymore, just clean up the value it holds. This is faster.
  3. Use Benchmark.
Alceu Rodrigues de Freitas Junior
---------------------------------
"You have enemies? Good. That means you've stood up for something, sometime in your life." - Sir Winston Churchill

In reply to Re: How to deal with Huge data by glasswalk3r
in thread How to deal with Huge data by Marsel

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • Outside of code tags, you may need to use entities for some characters:
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?
    Username:
    Password:

    What's my password?
    Create A New User
    Chatterbox?
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others chilling in the Monastery: (4)
    As of 2014-07-26 09:13 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?

      My favorite superfluous repetitious redundant duplicative phrase is:









      Results (175 votes), past polls