Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

Re^2: PSQL and many queries

by jfroebe (Vicar)
on Mar 24, 2009 at 02:40 UTC ( #752773=note: print w/ replies, xml ) Need Help??


in reply to Re: PSQL and many queries
in thread PSQL and many queries

That's assuming the key/value pairs are relatively small and unique. Unless I misread the OP, we don't know how much data is really involved or the complexity of it. (Remember that names can have 'special characters' or even a different character set.)

I agree with the other posters about performing a bulk load and using appropriate indexes. The optimizer in PostgreSQL (or any modern DBMS) is better suited for these types of data matches IMHO.

Jason L. Froebe

Blog, Tech Blog


Comment on Re^2: PSQL and many queries
Re^3: PSQL and many queries
by BrowserUk (Pope) on Mar 24, 2009 at 04:55 UTC
    That's assuming the key/value pairs are relatively small and unique.

    No values are required for lookup purposes. And the length of the keys makes surprisingly little difference to the size of the hash.

    For example: For a hash with 11,881,376 keys:

    • From 5-bytes keys (the minimum required) to 14-bytes keys, the memory required remains almost static at 1.7GB.
    • From 15-bytes through 29-bytes, it goes to 1.9 GB.
    • From 30-bytes through 46-bytes, it goes to 2.1 GB.
    • from 47-bytes .... , it goes to 2.3 GB.

    Given the OPs description of the application: "I have a list of names in a plain text file, one per line. I also have names in a PSQL database field. I need to filter out name from the file against the database. ", I'm not sure where you think "uniqueness" comes in?

    And given the OPs select statement, if duplicate names exist in the DB, they will be ignored, as there would be no way to determine which was matched.

    (Remember that names can have 'special characters' or even a different character set.)

    As far as I am aware, Perl's hashes handle unicode keys with aplomb.

    I agree with the other posters about performing a bulk load and using appropriate indexes. The optimizer in PostgreSQL (or any modern DBMS) is better suited for these types of data matches IMHO.

    In the time it will take you to create the temporary table and bulk load the data, Perl will have finished.

    And that's long before you will have:

    1. built your "appropriate indexes";
    2. done your join and extracted the union or intersection of the two datasets (the OP doesn't say which he is after);
    3. and either a) transported them back to the calling program; or b) output them to a file.
    4. Cleaned up (discard) the temporary tables and indexes you constructed.

    Don't believe me? I'll do it my way if you'll do it yours and we can compare. I'll let you write the test sets generator.

    It might be a different story if, for example, the requirement was for the union of the two datsets to end up in the DB. But in that case, the simplest thing would be to just insert all the records from the file into the DB. There would be little point in individually testing whether they already existed before adding them.

    Modern RDBMSs are very good for those things for which they are designed, but this kind of simple lookup isn't it.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      No values are required for lookup purposes. And the length of the keys makes surprisingly little difference to the size of the hash.

      For example: For a hash with 11,881,376 keys:

      • From 5-bytes keys (the minimum required) to 14-bytes keys, the memory required remains almost static at 1.7GB.
      • From 15-bytes through 29-bytes, it goes to 1.9 GB.
      • From 30-bytes through 46-bytes, it goes to 2.1 GB.
      • from 47-bytes .... , it goes to 2.3 GB.

      So if I have 11,881,376 keys with an average key size of say 250 bytes it would only be 2.3 GB? Same with 400 bytes? Or 500 bytes?

      Jason L. Froebe

      Blog, Tech Blog

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://752773]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others about the Monastery: (9)
As of 2014-12-27 16:57 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (177 votes), past polls