Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW

Comment on

( #3333=superdoc: print w/replies, xml ) Need Help??
First your CPU problem is due to a basic fact about databases. Every time you issue a prepare statement, the database tries to optimize it. Think of it as running a compiler for each query. That takes a lot of CPU. So stop making it do that. This is easy:
my $statement = "select name_field from table where name_field = ? lim +it 1"; my $sth = $dbh->prepare_cached($statement); $sth->execute($NAME) or die( $sth->errstr ); my @row_ary = $sth->fetchrow_array; if( @row_ary ){ $sth->finish; return 1; } else { $sth->finish; return 0; }
Anyone who wants a site to scale should learn this trick and use it. Because preparing queries more often than you need to is one of the top causes of scalability issues in databases.

After you've made this change, your CPU usage should be more reasonable. But it will probably still be slow. There are many causes. First there is network latency. For each row you issue a request, wait for an answer, then the database waits for you to send another request. All of that waiting wastes time. Secondly if you're asking for a set of records at once the database has ways to better optimize that request than if you're asking for one row at a time. Thirdly you have disk latency. You have a disk spinning at 5000 rpm, which is 100 times per second. Every time you want a piece of data off of disk, you have to wait on average for half a revolution before you can get it. You therefore get about 200 disk seeks per second. If you need to sequentially seek to disk a million times, that takes 5000 seconds minimum. (Databases know about these things, which is part of the reason why bulk operations are faster.)

If the first change makes your program fast enough, then great. If not, then you've got more work to do. I have used all of the following approaches, and would do so again in the right circumstances. I've listed them in terms of how much work it will be to get from where you are to what they can do:

  1. Use naive parallelism - have several copies of your script running on part of the data structure. If done properly this can compensate for network and disk latency, and should make the bottleneck be the database's CPU again.
  2. Make it the database's job. Bulk load your local data into the database and do a join there. (Tip: create a table, upload, and then create the index. If the index is there as you upload, it will be a far slower operation.) I prefer this solution when it is workable because you're less likely to screw up. That said, databases sometimes lack resources for doing larger jobs properly.
  3. Load your local file into a hash and select the whole table from the database. Be warned that PostgreSQL likes to send the whole dataset up front. Look for the section on cursors in DBD::Pg for advice on how to solve that problem.
  4. Sort your local file and select the whole table from the database in sorted order. Then walk the two in parallel as kind of a "zipping" operation. Be sure both are sorted in the same collation order, and again beware of PostgreSQL sending you all your data up front. (Cursors are still your solution.) Note that disk drives are good at accessing lots of data sequentially, so you can sort data on disk (eg with the standard Unix sort utility) fairly efficiently.
  5. Bulk download the database table to a local file then sort and merge.
Note that several of the solutions relied on sorting. How fast is sorting? Here is a back of the envelope pessimistic estimate. Suppose you have 10 million rows which are 100 bytes each. That's 1 GB. In a naively written merge sort you'd have to read and write the whole dataset about 25 times each, for 50 GB of total data movement. Suppose your disk drive streams 20 MB of data per second. That will take 2500 seconds, which is under an hour. (Remember that a million disk seeks took twice as long.)

Now I said pessimistic. That is because disk drives are somewhat faster than that, multi-way sorts need fewer passes than that, properly written sorts replace multiple passes on disk with subsorts in RAM. That sort could well take 5-15 minutes. (Less if you have enough RAM.)

In reply to Re: PSQL and many queries by tilly
in thread PSQL and many queries by citycrew

Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":

  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?

    What's my password?
    Create A New User
    and all is quiet...

    How do I use this? | Other CB clients
    Other Users?
    Others drinking their drinks and smoking their pipes about the Monastery: (2)
    As of 2018-08-15 03:17 GMT
    Find Nodes?
      Voting Booth?
      Asked to put a square peg in a round hole, I would:

      Results (158 votes). Check out past polls.