Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Comparing an array to a DBI table

by oakbox (Chaplain)
on Jul 18, 2001 at 17:02 UTC ( [id://97631]=perlquestion: print w/replies, xml ) Need Help??

oakbox has asked for the wisdom of the Perl Monks concerning the following question:

I have a list of userid's in an array. I have a DBI table. I want to find out which ID's in my array exist in my table.

I have about three work-arounds for this problem, but all of them are time consuming and a big drain on resources. I was wondering if any of you have come up with an easy way to compare the contents of a list (array) against a table (Perl:DBI:MySQL).

At the risk of looking like an idiot, here is one of my previous work-arounds:

I have a list of userid's (@compare_array) that i want to compare to my subscription table.

$primer=$dbh->prepare("SELECT count(*) FROM subscription where userid= +? "); foreach $userid (@compare_array){ $primer->execute($userid); if ($primer->err()){$message.=$primer->errstr(); &shellout; exit +;} ($count)=$primer->fetchrow_array(); if($count > 0){#there is a match }else{#there was no match } }

The above system WORKS, and works pretty well for tables of <1000 records. But when I'm dealing with 10,000 or 50,000 records, this search becomes an incredible drain.
Is there a way to perform this operation without running 25,000 SELECTs when @compare_array has that many ID's?

-oakbox.com

"If what I'm saying doesn't make sense, that's because sense cannot be made, it's something that must be sensed"
-J.S. Hall

Replies are listed 'Best First'.
Re: Comparing an array to a DBI table
by Masem (Monsignor) on Jul 18, 2001 at 17:11 UTC
    It's MUCH easier to grab the IDs from the DBI first, and whittle down from there. Note that to save time, everything's going into a hash instead of an array...

    my $sth = $dbh->prepare("SELECT userid FROM subscription") or die DBI: +:errstr; $sth->execute() or die DBI::errstr; my %db; while( my ( $id ) = $sth->fetchrow_array() ) { $db{ $id } = 1; } my @in_db = grep { defined( $db{ $_ } ) } @compare_array;

    Dr. Michael K. Neylon - mneylon-pm@masemware.com || "You've left the lens cap of your mind on again, Pinky" - The Brain
Re: Comparing an array to a DBI table
by TheoPetersen (Priest) on Jul 18, 2001 at 17:16 UTC
    Matching 50,000 anything against a database is going to be a pain. I would do one select though, draw the whole table into a hash of ids, then compare your array to it. This trades off memory for disk and CPU time; for this much data it is probably worth it, but not a good idea when the table contains millions of records.

    Any faster scheme would depend on properties of the table; for instance, if the user ids are integers and have longish sequences of values with no gaps, then you could store the start and end of sequences and so on.

Re: Comparing an array to a DBI table
by chipmunk (Parson) on Jul 18, 2001 at 17:39 UTC
    A similar question recently came up on the dbi-users mailing list. Check out the discussion in the archives, starting at Checking for the existence of a certain row.

    Quick summary: EXISTS is faster than COUNT(*) if the value exists in the column, especially if the column is indexed, but if the value doesn't exist in the column the query's going to be slow no matter what.

Re: Comparing an array to a DBI table
by merlyn (Sage) on Jul 18, 2001 at 19:39 UTC
    My principle is to always let the database do what it does well.
    my $count_sth = $dbh->prepare( 'SELECT count(*) FROM subscription WHERE userid IN (' . join(',', "?" x @compare_array) . ')'); $count_sth->execute(@compare_array); my $count = $count_sth->fetchrow_array;

    -- Randal L. Schwartz, Perl hacker

      Erk.
      The thought of joining 50,000 elements into an IN clause makes me shudder, possibly wrongly.
      • I was always under the impression there was a limit to the number of items IN would allow, I'm pretty sure there was in DB2 V5, and though I haven't yet found the Oracle man page to back this up, I've avoided large IN's ever since.
      • Not to mention the efficiency and the possible size of the SQL STATEMENT ......

      NB I'm going away to do my homework on IN now...and validate / invalidate my long held assumptions...

        Sure, there's a limit--for Oracle, I think it's 254. The value is system-dependent. (Shocked! I'm shocked!)

        So make a compromise method: Take the number of items IN will allow on your RDBMS, throw those into the statement, grab back the data, and loop till you've exhausted the items.

        I'd expect this to be more efficient than doing a select on every item--however, I'd study some explains and do some benchmarking to be get a better answer. Still, it's a valid approach to the problem.

        adamsj

        They laughed at Joan of Arc, but she went right ahead and built it. --Gracie Allen

Re: Comparing an array to a DBI table
by mattr (Curate) on Jul 18, 2001 at 18:00 UTC
    You might be able to minimize memory requirements for the hash strategy by sorting your file in advance and dividing it into chunks; your select can then read only the necessary range. If you really have memory problems perhaps an iterative method might be able to use LIMIT to pull a given number of records and find the range from what you pulled out of the DB.
Re: Comparing an array to a DBI table
by oakbox (Chaplain) on Jul 18, 2001 at 18:17 UTC
    Thank you for the suggestions.

    I don't know why it didn't occur to me to approach this from the other side and read the list of userid's into one big hash and compare THAT to my array. (thank you Masem for using the satori stick) I can see that this would be a big improvement.

    Luckily, most of my tables are in the 20k to 50k records range. If I ever need to deal with larger tables (500k records), I can see that dealing with it in chunks would be worthwhile.

    I appreciate everyone's help and wisdom
    - oakbox

      If there are more records than memory will allow, then you might consider reading all the user_ids into a DB_File tied hash. If you list of user_ids to check is not large, then use the 'IN' clause as suggested by merlyn. If your table is very large (or you don't want to use DB_File) AND your list is large, then don't use count(*), but something like this (BTW, I hope you have an index on user_id!):
      my $sql = 'select 1 from subscription where user_id = ?'; my $sth = $dbh->prepare($sql); foreach my $userid ... { $sth->execute($userid); my ($exists) = $sth->fetchrow_array; $sth->finish; if ($exists) { ... } }
      I also notice you're handling your own errors and probably not using RaiseError on the connect. That's ok, but then you should also check for errors on the prepare and the execute.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://97631]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others studying the Monastery: (3)
As of 2024-04-19 05:29 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found