Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??

Yeah, I had a similar problem recently: first remove duplicates from two files A and B, and then comparing the two files to output four files: data in A and not in B, data in B and not in A and two files with common data (well records having the same key from each file). Often, both operations can be done very efficiently using hashes (or using CPAN modules that use hashes). But my specific problem here is that the files were simply too big to fit in memory (about 15 gigabytes each), even only storing only the comparison keys did not work.

I tried various solutions using the various tied hashes and DBM modules, but it turned out that loading the data into DBM files was awfully slow.

I finally decided to sort the files (using the Unix utility) and process sequentially each file to remove the duplicates and then reading the two files in parallel to detect the "orphan" records and the common records. This ended up to be quite efficient (about 30 minutes for sorting the data and removing duplicates, and another 20 minutes to dispatch the records where they should go. I have of course no idea whether the same can apply to the OP's requirement.

This is getting slightly off-topic, but I actually made (or started to make) a relatively generic module to do this, because this is something that we have to do quite frequently, but each time with different data format (although usually CSV type of format) and different comparison keys. This module is still under development, as I am adding new functionalities into it (like, once I have the "common" files, i.e. records with the same comparison key, comparing the data). The good thing with this approach is that the files need to be sorted only once; once they are sorted, all the various operations can be done one after the other, the records remain sorted. Once this module is finished, I am hoping to make is available on the CPAN, but I will first need to learn how to build a CPAN distribution, as I have never done that yet (even though I am using and testing this module at work, I am developing it entirely at home on my free time, so that I can be the only owner of the code and so that my company cannot object to open source distribution).

Now, reading this thread, I realize that it could be that an SQLlite or MySQL solution might work as well or possibly even better. But then, I am not sure that it will be easy to obtain the installation of these products on our production servers (especially that I cannot prove that this would be useful without first trying).


In reply to Re^2: Parallel::Forkmanager and large hash, running out of memory by Laurent_R
in thread Parallel::Forkmanager and large hash, running out of memory by mabossert

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others examining the Monastery: (5)
As of 2024-04-19 04:41 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found