Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Re: Needed Performance improvement in reading and fetching from a file

by aufflick (Deacon)
on Oct 11, 2008 at 10:09 UTC ( [id://716579]=note: print w/replies, xml ) Need Help??


in reply to Needed Performance improvement in reading and fetching from a file

So what you are basically doing is:

1. Checking if column 2 has been seen already - if so, next line
2. Else, do some processing on the row and record the value of column 2 as seen

This is a pretty common thing to do and can be super fast. As already pointed out, the easiest win is to use a hash for maintaining the record of what col 2 values have been seen. That will get your check nearer O(1) than O(N).

20k isn't a lot - this is all you should have to do. If you find yourself dealing with a LOT of records (say half a million) you can get really cheap use of multiple cpu/cores (assuming you have them) by writing two scripts - 1 to strip out all the lines with duplicated col 2 values, the second can thus skip that step. Then pipe the output of one script to the input of the other and Unix will run the two processes in parallel for you. Assuming you are using a Unix OS that is. Something like:

cat the_file.txt | remove_duplicates.pl | process_data.pl
  • Comment on Re: Needed Performance improvement in reading and fetching from a file
  • Download Code

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://716579]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others surveying the Monastery: (4)
As of 2024-04-23 23:07 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found