Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

Comment on

( #3333=superdoc: print w/ replies, xml ) Need Help??
That's assuming the key/value pairs are relatively small and unique.

No values are required for lookup purposes. And the length of the keys makes surprisingly little difference to the size of the hash.

For example: For a hash with 11,881,376 keys:

  • From 5-bytes keys (the minimum required) to 14-bytes keys, the memory required remains almost static at 1.7GB.
  • From 15-bytes through 29-bytes, it goes to 1.9 GB.
  • From 30-bytes through 46-bytes, it goes to 2.1 GB.
  • from 47-bytes .... , it goes to 2.3 GB.

Given the OPs description of the application: "I have a list of names in a plain text file, one per line. I also have names in a PSQL database field. I need to filter out name from the file against the database. ", I'm not sure where you think "uniqueness" comes in?

And given the OPs select statement, if duplicate names exist in the DB, they will be ignored, as there would be no way to determine which was matched.

(Remember that names can have 'special characters' or even a different character set.)

As far as I am aware, Perl's hashes handle unicode keys with aplomb.

I agree with the other posters about performing a bulk load and using appropriate indexes. The optimizer in PostgreSQL (or any modern DBMS) is better suited for these types of data matches IMHO.

In the time it will take you to create the temporary table and bulk load the data, Perl will have finished.

And that's long before you will have:

  1. built your "appropriate indexes";
  2. done your join and extracted the union or intersection of the two datasets (the OP doesn't say which he is after);
  3. and either a) transported them back to the calling program; or b) output them to a file.
  4. Cleaned up (discard) the temporary tables and indexes you constructed.

Don't believe me? I'll do it my way if you'll do it yours and we can compare. I'll let you write the test sets generator.

It might be a different story if, for example, the requirement was for the union of the two datsets to end up in the DB. But in that case, the simplest thing would be to just insert all the records from the file into the DB. There would be little point in individually testing whether they already existed before adding them.

Modern RDBMSs are very good for those things for which they are designed, but this kind of simple lookup isn't it.


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

In reply to Re^3: PSQL and many queries by BrowserUk
in thread PSQL and many queries by citycrew

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • Outside of code tags, you may need to use entities for some characters:
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?
    Username:
    Password:

    What's my password?
    Create A New User
    Chatterbox?
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others musing on the Monastery: (5)
    As of 2014-12-21 06:34 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?

      Is guessing a good strategy for surviving in the IT business?





      Results (104 votes), past polls