Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

Comment on

( #3333=superdoc: print w/replies, xml ) Need Help??

First of all, what is the nature of the text file? Does it have repeated keys inside the same file?

Using hashes does consumes a lot of memory... but you can allways divide to conquer. ;-)

Suposing that you have two different files that you want to merge, you could try to reduce the file of each one looking for repeated keys and summing the numeric data.

Of course, this depends on the possibility to reduce the size of each file to an acceptable size. If this is not possible, you could consider working with slices of the file or using a database, since it will hold the data on the disc this should work. You can use any database, but DBD::DBM looks like ideal to your needs

What do you mean by saying "The cols are different in each file (but not always ...)"? This means the columns are in different places? Is easier to normalize that (put all columns and values in a previous defined position) and start working. For instance, once the files that the columns in the correct position, you can even forget about jumping over the first line: the program can easialy print the columns names in the output file later.

You're using hash references inside hashes... the more complicate structures you start using, the more memory the program will required.

Some other tips:

  1. If you're in a UNIX OS, then change the DOS new line characters (\r\n) to UNIX new lines (\n) before start processing the files. You can avoid using a regular expression and start using chomp to remove the new line
  2. Do not initialize an variable with my everytime the program enter in a loop. Initialize the variable before the loop and them, inside the loop, once the variable will not be used anymore, just clean up the value it holds. This is faster.
  3. Use Benchmark.
Alceu Rodrigues de Freitas Junior
---------------------------------
"You have enemies? Good. That means you've stood up for something, sometime in your life." - Sir Winston Churchill

In reply to Re: How to deal with Huge data by glasswalk3r
in thread How to deal with Huge data by Marsel

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?
    Username:
    Password:

    What's my password?
    Create A New User
    Chatterbox?
    [Eily]: BTW LanX, you should try typing a few random chars at the beginning of each message. This will prevent expansion :P
    [LanX]: qwiud you sthink so?
    [LanX]: zxwqbd good idea! :)
    LanX embraces his new habit spqopiwjdnq
    [ambrus]: qQUkZTmHTuKxStGT- BzTIK9gdudif7TkTLI t3mnF144UaAZjkknXY 8nN-QM19wHBsTrp5vB lEYU_Kksa7X1RIBB4x EWLD5X7SW3jGX5ryfN OMn_yL5FTdQxzjhtyX mKN9sjUCzBNHK5Rrp0 S2WMUvIb1i9aZFgjtq VR0GH1bjPMvm1G16iz hBqc1U6toPd4FbJOFj VsOeT745AN1_pO88rD SRAYKtBZwCZedESZmN mvutrOTHiSNwflB- pRfn_k
    [Eily]: so far it seems to work
    Your Mother reminds the monks they should be grateful not to share an office, lest they be subjugated to constant inanities like, "Czech please!"
    [LanX]: what's strange is that the "Cowboy you said this already" message is missing #dqiwd
    [LanX]: YM: BTW learn to mute your humanity
    [Your Mother]: Cumin? Now I want tacos...

    How do I use this? | Other CB clients
    Other Users?
    Others taking refuge in the Monastery: (17)
    As of 2017-03-27 16:53 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?
      Should Pluto Get Its Planethood Back?



      Results (320 votes). Check out past polls.