http://www.perlmonks.org?node_id=752740

citycrew has asked for the wisdom of the Perl Monks concerning the following question:

SITUATION:
I have to check many lines of a file, against records in a database. As an example, I have a list of names in a plain text file, one per line. I also have names in a PSQL database field. I need to filter out name from the file against the database. There may be anywhere from 10,000 to 10,000,000 names in the file, and anywhere from 1,000 to 2,000,000 names in the database.

CURRENT SETUP:
I am currently doing the following SQL call inside the file loop:

my $statement = "select name_field from table where name_field = '$NAM +E' limit 1"; my $sth = $dbh->prepare($statement); $sth->execute or die( $sth->errstr ); my @row_ary = $sth->fetchrow_array; if( @row_ary ){ $sth->finish; return 1; } else { $sth->finish; return 0; }

PROBLEM:
It is very slow and uses a lot of CPU. I have read over the DBD-Pg documentation to find a better way to use prepare/bind/execute statements, but couldn't fully understand it. I know I can clean up the current code (and will) but would like some direction before doing so.

Any help is much appreciated.

Replies are listed 'Best First'.
Re: PSQL and many queries
by tilly (Archbishop) on Mar 24, 2009 at 00:58 UTC
    First your CPU problem is due to a basic fact about databases. Every time you issue a prepare statement, the database tries to optimize it. Think of it as running a compiler for each query. That takes a lot of CPU. So stop making it do that. This is easy:
    my $statement = "select name_field from table where name_field = ? lim +it 1"; my $sth = $dbh->prepare_cached($statement); $sth->execute($NAME) or die( $sth->errstr ); my @row_ary = $sth->fetchrow_array; if( @row_ary ){ $sth->finish; return 1; } else { $sth->finish; return 0; }
    Anyone who wants a site to scale should learn this trick and use it. Because preparing queries more often than you need to is one of the top causes of scalability issues in databases.

    After you've made this change, your CPU usage should be more reasonable. But it will probably still be slow. There are many causes. First there is network latency. For each row you issue a request, wait for an answer, then the database waits for you to send another request. All of that waiting wastes time. Secondly if you're asking for a set of records at once the database has ways to better optimize that request than if you're asking for one row at a time. Thirdly you have disk latency. You have a disk spinning at 5000 rpm, which is 100 times per second. Every time you want a piece of data off of disk, you have to wait on average for half a revolution before you can get it. You therefore get about 200 disk seeks per second. If you need to sequentially seek to disk a million times, that takes 5000 seconds minimum. (Databases know about these things, which is part of the reason why bulk operations are faster.)

    If the first change makes your program fast enough, then great. If not, then you've got more work to do. I have used all of the following approaches, and would do so again in the right circumstances. I've listed them in terms of how much work it will be to get from where you are to what they can do:

    1. Use naive parallelism - have several copies of your script running on part of the data structure. If done properly this can compensate for network and disk latency, and should make the bottleneck be the database's CPU again.
    2. Make it the database's job. Bulk load your local data into the database and do a join there. (Tip: create a table, upload, and then create the index. If the index is there as you upload, it will be a far slower operation.) I prefer this solution when it is workable because you're less likely to screw up. That said, databases sometimes lack resources for doing larger jobs properly.
    3. Load your local file into a hash and select the whole table from the database. Be warned that PostgreSQL likes to send the whole dataset up front. Look for the section on cursors in DBD::Pg for advice on how to solve that problem.
    4. Sort your local file and select the whole table from the database in sorted order. Then walk the two in parallel as kind of a "zipping" operation. Be sure both are sorted in the same collation order, and again beware of PostgreSQL sending you all your data up front. (Cursors are still your solution.) Note that disk drives are good at accessing lots of data sequentially, so you can sort data on disk (eg with the standard Unix sort utility) fairly efficiently.
    5. Bulk download the database table to a local file then sort and merge.
    Note that several of the solutions relied on sorting. How fast is sorting? Here is a back of the envelope pessimistic estimate. Suppose you have 10 million rows which are 100 bytes each. That's 1 GB. In a naively written merge sort you'd have to read and write the whole dataset about 25 times each, for 50 GB of total data movement. Suppose your disk drive streams 20 MB of data per second. That will take 2500 seconds, which is under an hour. (Remember that a million disk seeks took twice as long.)

    Now I said pessimistic. That is because disk drives are somewhat faster than that, multi-way sorts need fewer passes than that, properly written sorts replace multiple passes on disk with subsorts in RAM. That sort could well take 5-15 minutes. (Less if you have enough RAM.)

Re: PSQL and many queries
by BrowserUk (Patriarch) on Mar 24, 2009 at 00:53 UTC

    It would almost certainly be considerably quicker to query the names from the DB in bulk and build a hash in perl. And then just process your file against the hash.

    A 10 million key hash should take 1.5 GB. And looking up 2 million keys against a 10 million key hash takes around 5 seconds.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      That's assuming the key/value pairs are relatively small and unique. Unless I misread the OP, we don't know how much data is really involved or the complexity of it. (Remember that names can have 'special characters' or even a different character set.)

      I agree with the other posters about performing a bulk load and using appropriate indexes. The optimizer in PostgreSQL (or any modern DBMS) is better suited for these types of data matches IMHO.

      Jason L. Froebe

      Blog, Tech Blog

        That's assuming the key/value pairs are relatively small and unique.

        No values are required for lookup purposes. And the length of the keys makes surprisingly little difference to the size of the hash.

        For example: For a hash with 11,881,376 keys:

        • From 5-bytes keys (the minimum required) to 14-bytes keys, the memory required remains almost static at 1.7GB.
        • From 15-bytes through 29-bytes, it goes to 1.9 GB.
        • From 30-bytes through 46-bytes, it goes to 2.1 GB.
        • from 47-bytes .... , it goes to 2.3 GB.

        Given the OPs description of the application: "I have a list of names in a plain text file, one per line. I also have names in a PSQL database field. I need to filter out name from the file against the database. ", I'm not sure where you think "uniqueness" comes in?

        And given the OPs select statement, if duplicate names exist in the DB, they will be ignored, as there would be no way to determine which was matched.

        (Remember that names can have 'special characters' or even a different character set.)

        As far as I am aware, Perl's hashes handle unicode keys with aplomb.

        I agree with the other posters about performing a bulk load and using appropriate indexes. The optimizer in PostgreSQL (or any modern DBMS) is better suited for these types of data matches IMHO.

        In the time it will take you to create the temporary table and bulk load the data, Perl will have finished.

        And that's long before you will have:

        1. built your "appropriate indexes";
        2. done your join and extracted the union or intersection of the two datasets (the OP doesn't say which he is after);
        3. and either a) transported them back to the calling program; or b) output them to a file.
        4. Cleaned up (discard) the temporary tables and indexes you constructed.

        Don't believe me? I'll do it my way if you'll do it yours and we can compare. I'll let you write the test sets generator.

        It might be a different story if, for example, the requirement was for the union of the two datsets to end up in the DB. But in that case, the simplest thing would be to just insert all the records from the file into the DB. There would be little point in individually testing whether they already existed before adding them.

        Modern RDBMSs are very good for those things for which they are designed, but this kind of simple lookup isn't it.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
      That's fine for a 10 million key hash. But what happens when he needs to handle a 20 million key hash and passes the addressing limit?

        When (if?) that happens, then a slower and more complex solution would be required.

        Even then, the simple expedient of splitting the task into two passes would double the life of the solution and would still work out faster and simpler than moving the processing into an RDBMS.

        But given the spread of the OPs stated range--10k to 10m--it seems likely that 10 million is an extreme, future-proofing upper bound already.

        The OP (presumably) knows his problem space. For us to make assumptions to the contrary and offer more complex solutions on the basis that it might allow for some unpredicted future growth, would be somewhat patronising.

        And given the simplicity of the solution, it's not as if he would have to throw away some huge amount of development effort if he did reach it's limits at some point in the future.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: PSQL and many queries
by fzellinger (Acolyte) on Mar 23, 2009 at 23:40 UTC
    Having tackled problems like this before, I don't think that there will be a solution as simple as "prepare your SQL statement this way and this will run 200x faster".

    We have to look at the resources being used when running your code against the database: CPU, MEMORY, DISK, NETWORK. A good model to keep in your head is that NETWORK is 1000x slower than DISK which is 1000x slower than MEMORY which is 10x slower than CPU. Your performance will be dictated by the amount of each resource you use to solve this problem. If your solution involves:
    1. ...crunching a bunch of data in the registers of the CPU only, it will be very fast.
    2. ...moving blocks of MEMORY around and doing some CPU crunching, it will be fast.
    3. ...loading lots of data from the DISK to MEMORY repeatedly to be crunched by the CPU, it won't be fast.
    4. ...moving large blocks of data across the NETWORK so that it can be loaded to MEMORY (and perhaps temporarily stored on DISK) before CPU crunching, it will really suck.

    In your situation, where is the database relative to your perl code? If the database and your code are running on the same CPU and the entire database and data file can be loaded into MEMORY, then everything will run fast. If your datafile and database are large (2M to 10M rows in your worst case scenario) then a lot of DISK activity may be involved to get the data up to your CPU. Having your data file and perl code on a CPU which is different than your database CPU (separated by a network) will be problematic if you need to transfer data many times (within loops).

    Assuming that you have enough MEMORY where your perl code is running, you can load the list of database records (names) into a hash use it to check the lines of the file. This would be the best approach when the database is small (less than 10K rows) and the data file is large (greater than 1M rows).

    If your database is large (greater than 1M rows) and your data file is REALLY small (10 rows), then your looping approach above will work fine. If your data file is slightly larger, but not too big (less than 50K rows), then you may want to simply load your data file into a temporary database table and use an SQL statement to join your records against the main lookup database. The reason this might be faster is that execution of SQL statements requires the database to parse the statement and then perhaps load statistics and indexes to find the data you want. You can save on SQL parsing time by doing perpare/bind/execute, but the database will still have to perform the actual lookup of indexes and data, which may require moving around a lot of MEMORY, or worst case loading a lot of blocks from DISK. When running 50K worth of SQL statements, the database engine has to perform complete lookup operations on each row. If you instead load 50k rows of data and ask the database (via one SQL statement) to match up the two name lists, the database engine can optimize how it performs lookups and streamline the entire process.

    All of this advice is fairly general, but which approach you use depends on which end of the spectrum your scenario falls into...and your starting critera (10K to 10M vs 1K to 1M variability) are very broad.

      While I appreciate that you put a lot of energy into this response, I don't agree with a lot of the advice you give.

      For instance if there is an index on the name, then the loop will work fine for much, much larger datasets than just 10 elements. I don't know where you get a 50K upper limit on a join from. I've joined million row tables in the database before and been satisfied with the result. You do a lot of talking about issues with disk without recognizing the important differences between seeking to disk and accessing disk sequentially. etc.

Re: PSQL and many queries
by roboticus (Chancellor) on Mar 24, 2009 at 00:01 UTC
    citycrew:

    I'd suggest using your databases bulk-loading tool, and load the names from the file into a table, then use an SQL statement to do the heavy lifting. Bulk loading is typically the fastest way to get the data into the database, and then you can add a key to the table and then join the tables to do your filtering for you. Doing a record-by-record sequence can be painfully slow because of all the communications overhead.

    ...roboticus
Re: PSQL and many queries
by sundialsvc4 (Abbot) on Mar 24, 2009 at 02:59 UTC

    Do it the COBOL way:

    1. Sort the file in ascending order by name, using a disk-based sort. (Don't worry... the operation is uncommonly fast.)
    2. Query the database, ORDER BY name. If you like, go ahead and extract this into another flat-file.
    3. Now you are presented with two identically sorted data-streams. The operation that you are now looking to perform is called a merge.

    It is possible to perform this operation (of course, without writing any new code to do so yourself...) by processing the two files sequentially. No searching is required. “10 million records?” No problem!

    (If you ever watched old sci-fi movies and wondered “what all those spinning tapes were doing” ... now you know.)

Re: PSQL and many queries
by citycrew (Acolyte) on Mar 24, 2009 at 01:12 UTC

    Thanks for the feedback guys!

    fzellinger:
    The database and code will be on the same machine. Sorry for the broad criteria, I should have just put in the ceiling values. They could get very high which is where I needed the feedback.

    My first thoughts were to use the system memory to load data file lines into a hash and load the DB records into an array (or visa versa) and check against each other. My hesitation on that was the memory load if the lists are large and more than one user runs this script

    I like the idea in dumping the data files names into a table and querying against that!

    I'll look into doing both solutions. If the data file and database isn't that large, I'll load it into memory with hashes, or I'll dump it into a table and query against it.

    tilly, your approaches will take me a few days to process, maybe i could write a script to help me process your solutions quicker.. =o) great food for thought

    Thanks again for the feedback!

      My first thoughts were to use the system memory to load data file lines into a hash and load the DB records into an array (or visa versa) and check against each other. My hesitation on that was the memory load if the lists are large and more than one user runs this script

      There's no need to load all of both datasets into memory.

      If you load the names from the DB into a hash (which DBI will do for you in one hit--select names from table; fetchall_hashref), then you can process your file line by line.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        There's no need to load all of both datasets into memory. If you load the names from the DB into a hash (which DBI will do for you in one hit--select names from table; fetchall_hashref), then you can process your file line by line.

        Of course!.. thanks BrowserUk