Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic

advice needed for processing largish data in multiple files

by jimbus (Friar)
on Aug 18, 2006 at 21:09 UTC ( #568264=perlquestion: print w/replies, xml ) Need Help??

jimbus has asked for the wisdom of the Perl Monks concerning the following question:

I'm looking for advice on how to set up merging files when the data in the secondary files are row oriented, not column oriented like the first... that is in the first file, the phone number is the key and the additional info is columns on the same line, real simple to dump into a DB. There are about 1.5 million records. In the next 3 files there are basically 2 columns: the phone number and legal value from a list of about 25 over the 3 files. The phone number and legal value form the key. In the end I need a list with the phone number as the key and all the info and options as columns in a single table.

My thought was that the first file could be parsed and inserted into mysql pretty simply or maybe even using awk to print out the columns I need and sqlloading it into the DB. For the other files I was thinking I would read each line and build a HOH with the phone number and lvl as the two keys, then update DB, but I'm concerned about speed and memory usage with that many records.

Your input on this is appreciated

--Jimbus aka Jim Babcock
Wireless Data Engineer and Geek Wannabe
  • Comment on advice needed for processing largish data in multiple files

Replies are listed 'Best First'.
Re: advice needed for processing largish data in multiple files
by talexb (Canon) on Aug 18, 2006 at 21:23 UTC

    While it may seem goofy, I feel your best bet is to go through a database, and let it do all of the heavy lifting, or sorting, of data. You can run out of memory a lot faster than a database will run out of disk space.

    You may have to write scripts to load the data, or you may be able to get the database to do that .. I've used bcp on Ingres, and LOAD INFILE (or something like that) in MySQL to do that. Again, my preference would be to go with the database provided tool, but a quick script will probably do the job too.

    And I highly recommend a few test runs with just 100 or so records to make you're doing it right before unleashing millions of records on the database.

    Alex / talexb / Toronto

    "Groklaw is the open-source mentality applied to legal research" ~ Linus Torvalds

Re: advice needed for processing largish data in multiple files
by BrowserUk (Pope) on Aug 18, 2006 at 21:25 UTC

    A hash with 1.5e6 keys requires around 300 MB if the value is a simple scalar, or roughly double that if the value is an array of 10 elements. Unless you are very low on memory, your initial instinct is probably a good one.

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: advice needed for processing largish data in multiple files
by shmem (Chancellor) on Aug 18, 2006 at 21:56 UTC
    I'd advocate using DB_File (or BerkeleyDB) and tied hashes. That should be faster than DB insert / query processing.


    _($_=" "x(1<<5)."?\n".q/)Oo.  G\        /
                                  /\_/(q    /
    ----------------------------  \__(m.====.(_("always off the crowd"))."
    ");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}
      The performance difference between DB_File and DBD::SQLite is minimal. Besides, with this data volume, either will be very fast. The more important consideration is whether the programming problem is one which is naturally expressed with SQL.
Re: advice needed for processing largish data in multiple files
by TedPride (Priest) on Aug 19, 2006 at 07:20 UTC
    I have no idea what the OP is trying to do here. Perhaps a simplified example of input / output data might be helpful? In general, though, if you don't have enough memory space, the best thing to do is process a section of your data at a time, which in this case means only doing a smaller range of the phone numbers on each pass. Just do x number of passes through all the files, ignoring the data that corresponds to phone numbers outside of the current range, dump the results into a sequence of files, and merge those files into a single file. Perhaps not the most efficient way to solve the problem, but rather simple to implement.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://568264]
Approved by Arunbear
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (4)
As of 2021-05-18 20:15 GMT
Find Nodes?
    Voting Booth?
    Perl 7 will be out ...

    Results (184 votes). Check out past polls.