Pathologically Eclectic Rubbish Lister | |
PerlMonks |
comment on |
( [id://3333]=superdoc: print w/replies, xml ) | Need Help?? |
Greetings, esteemed monks! Interesting problem. Several very good and interesting suggestions. The only thing I would add is that if the kill list is big, it might make the process faster if, as you find and delete the dead serial numbers from the big file, you delete them from the "kill list" also. This might make tests of subsequent lines faster. I believe that reasonablekeith's suggestion is similar, but it's contingent on the files being sorted. I am not sure how well this would work; would he spend more time adding the code to delete the hash key (assuming he goes that route) than he would save? It would just be one extra line in the loop, and only executed if a deletion happened. I would NOT use a regex with more than a handful of alternations--now THAT I am pretty sure would be significantly slower than a simple hash key lookup. It's possible that time spent sorting both lists beforehand would be less than the time saved by the sort (ie sorting would be a good thing). A possible additional benefit of sorting would be that you could use an array to store the kill list (as opposed to a hash) and just increment the array index whenever you delete the currently indexed serial number (or (for a more robust approach if you might have numbers in the kill list that aren't in the big file) when the serial number read from the file is greater than the currently indexed serial number to kill. Also, if we're talking about spending time preparing the data to make the actual update process faster, the gzip idea might be of benefit, but I am less sure of that, especially if the big file is read in one line at a time. _________________________________________________________________________________ I like computer programming because it's like Legos for the mind. In reply to Re: 15 billion row text file and row deletes - Best Practice?
by OfficeLinebacker
|
|