Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Re^2: How to remove duplicates from a large set of keys

by nite_man (Deacon)
on Feb 10, 2005 at 08:44 UTC ( #429624=note: print w/replies, xml ) Need Help??


in reply to Re: How to remove duplicates from a large set of keys
in thread How to remove duplicates from a large set of keys

Thanks for your repaly, Corion.

In the end, you will still need to have all keys in memory, or at least accessible
Why, in case of using a database I can just try to insert a new value. If that value is already exists in the table I'll get an exception 'Cannot insert a duplicated value bla-bla-bla'. But otherwise a new value will be inserted the the table.

a million keys shouldn't eat too much memory
The most important criterion for me is a speed of processing of new values. I haven't use databse approach yet but in case of using a hash a processing of one value takes about 40 seconds with 1 million hash keys. But the number of keys is increased and the time increased too.

---
Michael Stepanov aka nite_man

It's only my opinion and it doesn't have pretensions of absoluteness!

  • Comment on Re^2: How to remove duplicates from a large set of keys

Replies are listed 'Best First'.
Re^3: How to remove duplicates from a large set of keys
by Tanktalus (Canon) on Feb 10, 2005 at 15:29 UTC

    Whether you have your million records in memory (fast) or on disk in a database (slow), you have to take the time to insert your new data. Looking up existing data is different - as explained, looking up in a hash is O(1): you take the key, perform a calculation on it (which is dependant on the length of the key, not the size of the hash), and go to that entry in the (associative) array. Looking up in a database cannot be any faster than O(1). It can be as bad as O(log N) (I can't imagine any database doing an index lookup any slower than a binary search), which is dependant on the number of data points you're comparing to.

    The only way that a database could be faster is if it's a big honkin' box with lots of RAM, and that's a different box from your perl client.

    This problem is one of the primary reasons to use a hash. (Not the only one, but one of them nonetheless.)

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://429624]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others meditating upon the Monastery: (3)
As of 2021-06-23 21:31 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    What does the "s" stand for in "perls"? (Whence perls)












    Results (122 votes). Check out past polls.

    Notices?