Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

Re: How to remove duplicates from a large set of keys

by Corion (Pope)
on Feb 10, 2005 at 08:16 UTC ( #429621=note: print w/replies, xml ) Need Help??


in reply to How to remove duplicates from a large set of keys

In the end, you will still need to have all keys in memory, or at least accessible, that means, you will need some kind of hash, either a plain (in memory) hash, or a tied hash, that you tie to a dbm file for example.

You can possibly save on the side of the keys, by generating the checksums for the keys yourself, by using Digest::MD5 or something comparable, but that will only help you as long as your keys are on average longer than their MD5 length. You can also consider building a trie of your keys by building a linked list of keys with a common start, either a letter or a string. This increases the number of lookups you need to make, but can reduce the amount of memory you need, of your keys are long enough and have enough common prefixes. Still, a million keys shouldn't eat too much memory - about 1 million*32 bytes for the hash entries, plus the length of the keys in bytes.

  • Comment on Re: How to remove duplicates from a large set of keys

Replies are listed 'Best First'.
Re^2: How to remove duplicates from a large set of keys
by nite_man (Deacon) on Feb 10, 2005 at 08:44 UTC

    Thanks for your repaly, Corion.

    In the end, you will still need to have all keys in memory, or at least accessible
    Why, in case of using a database I can just try to insert a new value. If that value is already exists in the table I'll get an exception 'Cannot insert a duplicated value bla-bla-bla'. But otherwise a new value will be inserted the the table.

    a million keys shouldn't eat too much memory
    The most important criterion for me is a speed of processing of new values. I haven't use databse approach yet but in case of using a hash a processing of one value takes about 40 seconds with 1 million hash keys. But the number of keys is increased and the time increased too.

    ---
    Michael Stepanov aka nite_man

    It's only my opinion and it doesn't have pretensions of absoluteness!

      Whether you have your million records in memory (fast) or on disk in a database (slow), you have to take the time to insert your new data. Looking up existing data is different - as explained, looking up in a hash is O(1): you take the key, perform a calculation on it (which is dependant on the length of the key, not the size of the hash), and go to that entry in the (associative) array. Looking up in a database cannot be any faster than O(1). It can be as bad as O(log N) (I can't imagine any database doing an index lookup any slower than a binary search), which is dependant on the number of data points you're comparing to.

      The only way that a database could be faster is if it's a big honkin' box with lots of RAM, and that's a different box from your perl client.

      This problem is one of the primary reasons to use a hash. (Not the only one, but one of them nonetheless.)

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://429621]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others perusing the Monastery: (2)
As of 2021-07-24 06:16 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?