Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??
Greetings, esteemed monks!

Interesting problem. Several very good and interesting suggestions. The only thing I would add is that if the kill list is big, it might make the process faster if, as you find and delete the dead serial numbers from the big file, you delete them from the "kill list" also. This might make tests of subsequent lines faster. I believe that reasonablekeith's suggestion is similar, but it's contingent on the files being sorted.

I am not sure how well this would work; would he spend more time adding the code to delete the hash key (assuming he goes that route) than he would save? It would just be one extra line in the loop, and only executed if a deletion happened.

I would NOT use a regex with more than a handful of alternations--now THAT I am pretty sure would be significantly slower than a simple hash key lookup.

It's possible that time spent sorting both lists beforehand would be less than the time saved by the sort (ie sorting would be a good thing). A possible additional benefit of sorting would be that you could use an array to store the kill list (as opposed to a hash) and just increment the array index whenever you delete the currently indexed serial number (or (for a more robust approach if you might have numbers in the kill list that aren't in the big file) when the serial number read from the file is greater than the currently indexed serial number to kill.

Also, if we're talking about spending time preparing the data to make the actual update process faster, the gzip idea might be of benefit, but I am less sure of that, especially if the big file is read in one line at a time.

_________________________________________________________________________________

I like computer programming because it's like Legos for the mind.


In reply to Re: 15 billion row text file and row deletes - Best Practice? by OfficeLinebacker
in thread 15 billion row text file and row deletes - Best Practice? by awohld

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others taking refuge in the Monastery: (6)
As of 2024-04-19 20:33 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found