Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

•Re: Efficient search through a huge dataset

by merlyn (Sage)
on Oct 19, 2004 at 23:58 UTC ( [id://400710]=note: print w/replies, xml ) Need Help??


in reply to Efficient search through a huge dataset

Sounds like a great job for DBD::SQLite.

-- Randal L. Schwartz, Perl hacker
Be sure to read my standard disclaimer if this is a reply.

  • Comment on •Re: Efficient search through a huge dataset

Replies are listed 'Best First'.
Re^2: Efficient search through a huge dataset
by johnnywang (Priest) on Oct 20, 2004 at 01:03 UTC
    I guess you meant to store the first file as database records in a table, and for each record in the second file, do a select from the table? That will be millions of selects, can a perl/batch be faster? Thanks.

      Millions of records? Not true at all!

      You can do simply one query, say you want to find out all records that are in table1 but not table2, then you can either:

      select blah from table1 t1 where not exists (select 1 from table2 t2 where t1.blah = t2.blah )

      Or (if munus is supported):

      select blah from table1 minus select blah from table2

      It should be easy for you to modify the query a bit to satisfy your needs.

        Faster and better would be outer joins
        SELECT t1.blah FROM t1 LEFT OUTER JOIN t2 ON (t1.blah = t2.blah) WHERE t1.blah IS NOT NULL AND t2.blah IS NULL

        It's the correct way to rewrite the NOT EXISTS (about 100x faster in my tests) and is faster than the MINUS because it doesn't have to pull both result sets into memory, then do an expensive ARRAYCOMPARE on them.

        Being right, does not endow the right to be rude; politeness costs nothing.
        Being unknowing, is not the same as being stupid.
        Expressing a contrary opinion, whether to the individual or the group, is more often a sign of deeper thought than of cantankerous belligerence.
        Do not mistake your goals as the only goals; your opinion as the only opinion; your confidence as correctness. Saying you know better is not the same as explaining you know better.

Re^2: Efficient search through a huge dataset
by pg (Canon) on Oct 20, 2004 at 04:44 UTC

    DBD::SQLite might be the answer to this particular case, but it is slow when you try to insert into a table with index. (Not sure what happens without index, general speaking insert is slower with index, when select is faster with index. That's OT)

    I compared DBD::SQLite, with ODBC, the same table structure and index. insert 1000 rows, and then select. It took ODBC 2 seconds to insert 1000 rows, and 0 (which means less than 1) second to select; but it took SQLite 310 seconds to insert (way to big), adn 0 second to select (which is virtually the same):

    use DBI; use Data::Dumper; use strict; use warnings; my $dbh = DBI->connect("dbi:SQLite:dbname=dbfile","",""); #my $dbh = DBI->connect("dbi:ODBC:everything","",""); =document #$dbh->do('create table table1(col1 number(10), col2 number(10))'); $dbh->do('create table table1(col1 int, col2 int)'); $dbh->do('create index index1 on table1(col1)'); #$dbh->do('create table table2(col1 number(10), col2 number(10))'); $dbh->do('create table table2(col1 int, col2 int)'); $dbh->do('create index index2 on table2(col1)'); =cut $dbh->do('delete from table1'); $dbh->do('delete from table2'); my $st1 = $dbh->prepare('insert into table1(col1, col2) values(?, ?)') +; my $st2 = $dbh->prepare('insert into table2(col1, col2) values(?, ?)') +; print time, "\n"; for my $i (1..1000) { $st1->execute($i, $i * 2); if ($i % 2) { $st2->execute($i, $i * 3); } } print time, "\n"; { my $st3 = $dbh->prepare('SELECT t1.col1 FROM table1 t1 LEFT OUTER +JOIN table2 t2 ON (t1.col1 = t2.col1) WHERE t1.col1 IS NOT NULL AND t +2.col1 IS NULL'); $st3->execute(); my $ref3 = $st3->fetchall_arrayref; #print Dumper($ref3); } print time, "\n"; $dbh->disconnect();
      it took SQLite 310 seconds to insert

      The SQLite docs say what you should do to improve speed: group instructions into transactions and eventually modify the "syncronous" pragma.

      If you add $dbh->do('begin'); before your loop and $dbh->do('commit'); at the end, the insertion will take less than one second, as it should be.

      use Time::HiRes qw(gettimeofday tv_interval); my $start_time = [gettimeofday]; $dbh->do('begin'); for my $i (1..1000) { $st1->execute($i, $i * 2); if ($i % 2) { $st2->execute($i, $i * 3); } } $dbh->do('commit'); my $elapsed = tv_interval ( $start_time, [gettimeofday]); print "creation time: $elapsed\n"; __END__ insertion time: 0.263462
      A reply falls below the community's threshold of quality. You may see it by logging in.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://400710]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others cooling their heels in the Monastery: (4)
As of 2024-04-18 07:03 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found