Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

Re: PSQL and many queries

by BrowserUk (Pope)
on Mar 24, 2009 at 00:53 UTC ( #752760=note: print w/ replies, xml ) Need Help??


in reply to PSQL and many queries

It would almost certainly be considerably quicker to query the names from the DB in bulk and build a hash in perl. And then just process your file against the hash.

A 10 million key hash should take 1.5 GB. And looking up 2 million keys against a 10 million key hash takes around 5 seconds.


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.


Comment on Re: PSQL and many queries
Re^2: PSQL and many queries
by jfroebe (Vicar) on Mar 24, 2009 at 02:40 UTC

    That's assuming the key/value pairs are relatively small and unique. Unless I misread the OP, we don't know how much data is really involved or the complexity of it. (Remember that names can have 'special characters' or even a different character set.)

    I agree with the other posters about performing a bulk load and using appropriate indexes. The optimizer in PostgreSQL (or any modern DBMS) is better suited for these types of data matches IMHO.

    Jason L. Froebe

    Blog, Tech Blog

      That's assuming the key/value pairs are relatively small and unique.

      No values are required for lookup purposes. And the length of the keys makes surprisingly little difference to the size of the hash.

      For example: For a hash with 11,881,376 keys:

      • From 5-bytes keys (the minimum required) to 14-bytes keys, the memory required remains almost static at 1.7GB.
      • From 15-bytes through 29-bytes, it goes to 1.9 GB.
      • From 30-bytes through 46-bytes, it goes to 2.1 GB.
      • from 47-bytes .... , it goes to 2.3 GB.

      Given the OPs description of the application: "I have a list of names in a plain text file, one per line. I also have names in a PSQL database field. I need to filter out name from the file against the database. ", I'm not sure where you think "uniqueness" comes in?

      And given the OPs select statement, if duplicate names exist in the DB, they will be ignored, as there would be no way to determine which was matched.

      (Remember that names can have 'special characters' or even a different character set.)

      As far as I am aware, Perl's hashes handle unicode keys with aplomb.

      I agree with the other posters about performing a bulk load and using appropriate indexes. The optimizer in PostgreSQL (or any modern DBMS) is better suited for these types of data matches IMHO.

      In the time it will take you to create the temporary table and bulk load the data, Perl will have finished.

      And that's long before you will have:

      1. built your "appropriate indexes";
      2. done your join and extracted the union or intersection of the two datasets (the OP doesn't say which he is after);
      3. and either a) transported them back to the calling program; or b) output them to a file.
      4. Cleaned up (discard) the temporary tables and indexes you constructed.

      Don't believe me? I'll do it my way if you'll do it yours and we can compare. I'll let you write the test sets generator.

      It might be a different story if, for example, the requirement was for the union of the two datsets to end up in the DB. But in that case, the simplest thing would be to just insert all the records from the file into the DB. There would be little point in individually testing whether they already existed before adding them.

      Modern RDBMSs are very good for those things for which they are designed, but this kind of simple lookup isn't it.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

        No values are required for lookup purposes. And the length of the keys makes surprisingly little difference to the size of the hash.

        For example: For a hash with 11,881,376 keys:

        • From 5-bytes keys (the minimum required) to 14-bytes keys, the memory required remains almost static at 1.7GB.
        • From 15-bytes through 29-bytes, it goes to 1.9 GB.
        • From 30-bytes through 46-bytes, it goes to 2.1 GB.
        • from 47-bytes .... , it goes to 2.3 GB.

        So if I have 11,881,376 keys with an average key size of say 250 bytes it would only be 2.3 GB? Same with 400 bytes? Or 500 bytes?

        Jason L. Froebe

        Blog, Tech Blog

Re^2: PSQL and many queries
by tilly (Archbishop) on Mar 24, 2009 at 14:36 UTC
    That's fine for a 10 million key hash. But what happens when he needs to handle a 20 million key hash and passes the addressing limit?

      When (if?) that happens, then a slower and more complex solution would be required.

      Even then, the simple expedient of splitting the task into two passes would double the life of the solution and would still work out faster and simpler than moving the processing into an RDBMS.

      But given the spread of the OPs stated range--10k to 10m--it seems likely that 10 million is an extreme, future-proofing upper bound already.

      The OP (presumably) knows his problem space. For us to make assumptions to the contrary and offer more complex solutions on the basis that it might allow for some unpredicted future growth, would be somewhat patronising.

      And given the simplicity of the solution, it's not as if he would have to throw away some huge amount of development effort if he did reach it's limits at some point in the future.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://752760]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others having an uproarious good time at the Monastery: (5)
As of 2014-07-11 23:51 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    When choosing user names for websites, I prefer to use:








    Results (236 votes), past polls