Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

Re: Re: Merge Purge

by krazken (Scribe)
on Mar 22, 2002 at 15:09 UTC ( [id://153570]=note: print w/replies, xml ) Need Help??


in reply to Re: Merge Purge
in thread Merge Purge

I would use a database for this, but I already have it in a flat file, and the program that assigns that matchkey runs on a flat file as well, so instead of wasting time trying to load millions of records into a database, I just work on the flat file. Plus the file is already sorted on the matchkey coming out the previous program. I probably need to take the approach of taking advantage of the fact that the file is sorted and read until my matchkey changes then process that matchgroup and then read the next. But, there are times when I /try/ to write flexible code to where it wouldn't matter if the file was sorted or not. I would like it to work either way. make sense?

Replies are listed 'Best First'.
Re: (3): Merge Purge
by shotgunefx (Parson) on Mar 22, 2002 at 19:46 UTC
    I think with your DB_File approach, the biggest problem is one read/write for every record. I had a similar problem with a search index for 5,000,000 books. The thing took around 18 hours to finish. Taking advantage or sorting it and working with the current record cut it down to 17 minutes.

    One thing I thought of (I don't know if someone else has done it. Couldn't find it at the time.) was to subclass tie DB_File and make a hash that wouldn't always read and write on every access. It would have an intermidiate cache. If you implemented caching behavior like this, it would probably speed it up an order of magnitude when the data was fairly sorted and still work about the same for the general case all while being nice and generic.

    -Lee

    "To be civilized is to deny one's nature."

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://153570]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having an uproarious good time at the Monastery: (5)
As of 2024-05-26 19:06 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found