citycrew has asked for the wisdom of the Perl Monks concerning the following question:
SITUATION:
I have to check many lines of a file, against records in a database. As an example, I have a list of names in a plain text file, one per line. I also have names in a PSQL database field. I need to filter out name from the file against the database. There may be anywhere from 10,000 to 10,000,000 names in the file, and anywhere from 1,000 to 2,000,000 names in the database.
CURRENT SETUP:
I am currently doing the following SQL call inside the file loop:
my $statement = "select name_field from table where name_field = '$NAM +E' limit 1"; my $sth = $dbh->prepare($statement); $sth->execute or die( $sth->errstr ); my @row_ary = $sth->fetchrow_array; if( @row_ary ){ $sth->finish; return 1; } else { $sth->finish; return 0; }
PROBLEM:
It is very slow and uses a lot of CPU. I have read over the DBD-Pg documentation to find a better way to use prepare/bind/execute statements, but couldn't fully understand it. I know I can clean up the current code (and will) but would like some direction before doing so.
Any help is much appreciated.
|
---|
Replies are listed 'Best First'. | |
---|---|
Re: PSQL and many queries
by tilly (Archbishop) on Mar 24, 2009 at 00:58 UTC | |
Anyone who wants a site to scale should learn this trick and use it. Because preparing queries more often than you need to is one of the top causes of scalability issues in databases. After you've made this change, your CPU usage should be more reasonable. But it will probably still be slow. There are many causes. First there is network latency. For each row you issue a request, wait for an answer, then the database waits for you to send another request. All of that waiting wastes time. Secondly if you're asking for a set of records at once the database has ways to better optimize that request than if you're asking for one row at a time. Thirdly you have disk latency. You have a disk spinning at 5000 rpm, which is 100 times per second. Every time you want a piece of data off of disk, you have to wait on average for half a revolution before you can get it. You therefore get about 200 disk seeks per second. If you need to sequentially seek to disk a million times, that takes 5000 seconds minimum. (Databases know about these things, which is part of the reason why bulk operations are faster.) If the first change makes your program fast enough, then great. If not, then you've got more work to do. I have used all of the following approaches, and would do so again in the right circumstances. I've listed them in terms of how much work it will be to get from where you are to what they can do:
Now I said pessimistic. That is because disk drives are somewhat faster than that, multi-way sorts need fewer passes than that, properly written sorts replace multiple passes on disk with subsorts in RAM. That sort could well take 5-15 minutes. (Less if you have enough RAM.) | [reply] [Watch: Dir/Any] [d/l] |
Re: PSQL and many queries
by BrowserUk (Patriarch) on Mar 24, 2009 at 00:53 UTC | |
It would almost certainly be considerably quicker to query the names from the DB in bulk and build a hash in perl. And then just process your file against the hash. A 10 million key hash should take 1.5 GB. And looking up 2 million keys against a 10 million key hash takes around 5 seconds. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [Watch: Dir/Any] |
by jfroebe (Parson) on Mar 24, 2009 at 02:40 UTC | |
That's assuming the key/value pairs are relatively small and unique. Unless I misread the OP, we don't know how much data is really involved or the complexity of it. (Remember that names can have 'special characters' or even a different character set.) I agree with the other posters about performing a bulk load and using appropriate indexes. The optimizer in PostgreSQL (or any modern DBMS) is better suited for these types of data matches IMHO. | [reply] [Watch: Dir/Any] |
by BrowserUk (Patriarch) on Mar 24, 2009 at 04:55 UTC | |
That's assuming the key/value pairs are relatively small and unique. No values are required for lookup purposes. And the length of the keys makes surprisingly little difference to the size of the hash. For example: For a hash with 11,881,376 keys: Given the OPs description of the application: "I have a list of names in a plain text file, one per line. I also have names in a PSQL database field. I need to filter out name from the file against the database. ", I'm not sure where you think "uniqueness" comes in? And given the OPs select statement, if duplicate names exist in the DB, they will be ignored, as there would be no way to determine which was matched. (Remember that names can have 'special characters' or even a different character set.) As far as I am aware, Perl's hashes handle unicode keys with aplomb. I agree with the other posters about performing a bulk load and using appropriate indexes. The optimizer in PostgreSQL (or any modern DBMS) is better suited for these types of data matches IMHO. In the time it will take you to create the temporary table and bulk load the data, Perl will have finished. And that's long before you will have: Don't believe me? I'll do it my way if you'll do it yours and we can compare. I'll let you write the test sets generator. It might be a different story if, for example, the requirement was for the union of the two datsets to end up in the DB. But in that case, the simplest thing would be to just insert all the records from the file into the DB. There would be little point in individually testing whether they already existed before adding them. Modern RDBMSs are very good for those things for which they are designed, but this kind of simple lookup isn't it. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [Watch: Dir/Any] |
by jfroebe (Parson) on Mar 24, 2009 at 13:25 UTC | |
by BrowserUk (Patriarch) on Mar 24, 2009 at 14:00 UTC | |
by tilly (Archbishop) on Mar 24, 2009 at 14:36 UTC | |
| [reply] [Watch: Dir/Any] |
by BrowserUk (Patriarch) on Mar 24, 2009 at 15:12 UTC | |
When (if?) that happens, then a slower and more complex solution would be required. Even then, the simple expedient of splitting the task into two passes would double the life of the solution and would still work out faster and simpler than moving the processing into an RDBMS. But given the spread of the OPs stated range--10k to 10m--it seems likely that 10 million is an extreme, future-proofing upper bound already. The OP (presumably) knows his problem space. For us to make assumptions to the contrary and offer more complex solutions on the basis that it might allow for some unpredicted future growth, would be somewhat patronising. And given the simplicity of the solution, it's not as if he would have to throw away some huge amount of development effort if he did reach it's limits at some point in the future. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [Watch: Dir/Any] |
Re: PSQL and many queries
by fzellinger (Acolyte) on Mar 23, 2009 at 23:40 UTC | |
We have to look at the resources being used when running your code against the database: CPU, MEMORY, DISK, NETWORK. A good model to keep in your head is that NETWORK is 1000x slower than DISK which is 1000x slower than MEMORY which is 10x slower than CPU. Your performance will be dictated by the amount of each resource you use to solve this problem. If your solution involves:
In your situation, where is the database relative to your perl code? If the database and your code are running on the same CPU and the entire database and data file can be loaded into MEMORY, then everything will run fast. If your datafile and database are large (2M to 10M rows in your worst case scenario) then a lot of DISK activity may be involved to get the data up to your CPU. Having your data file and perl code on a CPU which is different than your database CPU (separated by a network) will be problematic if you need to transfer data many times (within loops).
Assuming that you have enough MEMORY where your perl code is running, you can load the list of database records (names) into a hash use it to check the lines of the file. This would be the best approach when the database is small (less than 10K rows) and the data file is large (greater than 1M rows).
If your database is large (greater than 1M rows) and your data file is REALLY small (10 rows), then your looping approach above will work fine. If your data file is slightly larger, but not too big (less than 50K rows), then you may want to simply load your data file into a temporary database table and use an SQL statement to join your records against the main lookup database. The reason this might be faster is that execution of SQL statements requires the database to parse the statement and then perhaps load statistics and indexes to find the data you want. You can save on SQL parsing time by doing perpare/bind/execute, but the database will still have to perform the actual lookup of indexes and data, which may require moving around a lot of MEMORY, or worst case loading a lot of blocks from DISK. When running 50K worth of SQL statements, the database engine has to perform complete lookup operations on each row. If you instead load 50k rows of data and ask the database (via one SQL statement) to match up the two name lists, the database engine can optimize how it performs lookups and streamline the entire process. All of this advice is fairly general, but which approach you use depends on which end of the spectrum your scenario falls into...and your starting critera (10K to 10M vs 1K to 1M variability) are very broad. | [reply] [Watch: Dir/Any] |
by tilly (Archbishop) on Mar 24, 2009 at 01:12 UTC | |
For instance if there is an index on the name, then the loop will work fine for much, much larger datasets than just 10 elements. I don't know where you get a 50K upper limit on a join from. I've joined million row tables in the database before and been satisfied with the result. You do a lot of talking about issues with disk without recognizing the important differences between seeking to disk and accessing disk sequentially. etc. | [reply] [Watch: Dir/Any] |
Re: PSQL and many queries
by roboticus (Chancellor) on Mar 24, 2009 at 00:01 UTC | |
I'd suggest using your databases bulk-loading tool, and load the names from the file into a table, then use an SQL statement to do the heavy lifting. Bulk loading is typically the fastest way to get the data into the database, and then you can add a key to the table and then join the tables to do your filtering for you. Doing a record-by-record sequence can be painfully slow because of all the communications overhead. ...roboticus | [reply] [Watch: Dir/Any] |
Re: PSQL and many queries
by sundialsvc4 (Abbot) on Mar 24, 2009 at 02:59 UTC | |
Do it the COBOL way: It is possible to perform this operation (of course, without writing any new code to do so yourself...) by processing the two files sequentially. No searching is required. “10 million records?” No problem! (If you ever watched old sci-fi movies and wondered “what all those spinning tapes were doing” ... now you know.) | [reply] [Watch: Dir/Any] [d/l] |
Re: PSQL and many queries
by citycrew (Acolyte) on Mar 24, 2009 at 01:12 UTC | |
Thanks for the feedback guys! fzellinger: My first thoughts were to use the system memory to load data file lines into a hash and load the DB records into an array (or visa versa) and check against each other. My hesitation on that was the memory load if the lists are large and more than one user runs this script I like the idea in dumping the data files names into a table and querying against that! I'll look into doing both solutions. If the data file and database isn't that large, I'll load it into memory with hashes, or I'll dump it into a table and query against it. tilly, your approaches will take me a few days to process, maybe i could write a script to help me process your solutions quicker.. =o) great food for thought Thanks again for the feedback! | [reply] [Watch: Dir/Any] |
by BrowserUk (Patriarch) on Mar 24, 2009 at 01:27 UTC | |
My first thoughts were to use the system memory to load data file lines into a hash and load the DB records into an array (or visa versa) and check against each other. My hesitation on that was the memory load if the lists are large and more than one user runs this script There's no need to load all of both datasets into memory. If you load the names from the DB into a hash (which DBI will do for you in one hit--select names from table; fetchall_hashref), then you can process your file line by line. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [Watch: Dir/Any] |
by citycrew (Acolyte) on Mar 24, 2009 at 02:15 UTC | |
There's no need to load all of both datasets into memory. If you load the names from the DB into a hash (which DBI will do for you in one hit--select names from table; fetchall_hashref), then you can process your file line by line. Of course!.. thanks BrowserUk | [reply] [Watch: Dir/Any] |