•Re: Efficient search through a huge dataset

Replies are listed 'Best First'.
Re^2: Efficient search through a huge dataset by johnnywang (Priest) on Oct 20, 2004 at 01:03 UTC
I guess you meant to store the first file as database records in a table, and for each record in the second file, do a select from the table? That will be millions of selects, can a perl/batch be faster? Thanks.	[reply]
Re^3: Efficient search through a huge dataset by pg (Canon) on Oct 20, 2004 at 02:11 UTC
Millions of records? Not true at all! You can do simply one query, say you want to find out all records that are in table1 but not table2, then you can either: `select blah from table1 t1 where not exists (select 1 from table2 t2 where t1.blah = t2.blah )` [download] Or (if munus is supported): `select blah from table1 minus select blah from table2` [download] It should be easy for you to modify the query a bit to satisfy your needs.	[reply] [d/l] [select]
Re^4: Efficient search through a huge dataset by dragonchild (Archbishop) on Oct 20, 2004 at 02:54 UTC
Faster and better would be outer joins `SELECT t1.blah FROM t1 LEFT OUTER JOIN t2 ON (t1.blah = t2.blah) WHERE t1.blah IS NOT NULL AND t2.blah IS NULL` [download] It's the correct way to rewrite the NOT EXISTS (about 100x faster in my tests) and is faster than the MINUS because it doesn't have to pull both result sets into memory, then do an expensive ARRAYCOMPARE on them. Being right, does not endow the right to be rude; politeness costs nothing. Being unknowing, is not the same as being stupid. Expressing a contrary opinion, whether to the individual or the group, is more often a sign of deeper thought than of cantankerous belligerence. Do not mistake your goals as the only goals; your opinion as the only opinion; your confidence as correctness. Saying you know better is not the same as explaining you know better.	[reply] [d/l]
Re^5: Efficient search through a huge dataset by Corion (Patriarch) on Oct 20, 2004 at 11:39 UTC
Re^6: Efficient search through a huge dataset by dragonchild (Archbishop) on Oct 20, 2004 at 13:28 UTC
Re^2: Efficient search through a huge dataset by pg (Canon) on Oct 20, 2004 at 04:44 UTC
DBD::SQLite might be the answer to this particular case, but it is slow when you try to insert into a table with index. (Not sure what happens without index, general speaking insert is slower with index, when select is faster with index. That's OT) I compared DBD::SQLite, with ODBC, the same table structure and index. insert 1000 rows, and then select. It took ODBC 2 seconds to insert 1000 rows, and 0 (which means less than 1) second to select; but it took SQLite 310 seconds to insert (way to big), adn 0 second to select (which is virtually the same): use DBI; use Data::Dumper; use strict; use warnings; my $dbh = DBI->connect("dbi:SQLite:dbname=dbfile","",""); #my $dbh = DBI->connect("dbi:ODBC:everything","",""); =document #$dbh->do('create table table1(col1 number(10), col2 number(10))'); $dbh->do('create table table1(col1 int, col2 int)'); $dbh->do('create index index1 on table1(col1)'); #$dbh->do('create table table2(col1 number(10), col2 number(10))'); $dbh->do('create table table2(col1 int, col2 int)'); $dbh->do('create index index2 on table2(col1)'); =cut $dbh->do('delete from table1'); $dbh->do('delete from table2'); my $st1 = $dbh->prepare('insert into table1(col1, col2) values(?, ?)') +; my $st2 = $dbh->prepare('insert into table2(col1, col2) values(?, ?)') +; print time, "\n"; for my $i (1..1000) { $st1->execute($i, $i * 2); if ($i % 2) { $st2->execute($i, $i * 3); } } print time, "\n"; { my $st3 = $dbh->prepare('SELECT t1.col1 FROM table1 t1 LEFT OUTER +JOIN table2 t2 ON (t1.col1 = t2.col1) WHERE t1.col1 IS NOT NULL AND t +2.col1 IS NULL'); $st3->execute(); my $ref3 = $st3->fetchall_arrayref; #print Dumper($ref3); } print time, "\n"; $dbh->disconnect(); [download]	[reply] [d/l]
Re^3: Efficient search through a huge dataset by Caron (Friar) on Oct 20, 2004 at 08:12 UTC
it took SQLite 310 seconds to insert The SQLite docs say what you should do to improve speed: group instructions into transactions and eventually modify the "syncronous" pragma. If you add `$dbh->do('begin');` before your loop and `$dbh->do('commit');` at the end, the insertion will take less than one second, as it should be. `use Time::HiRes qw(gettimeofday tv_interval); my $start_time = [gettimeofday]; $dbh->do('begin'); for my $i (1..1000) { $st1->execute($i, $i * 2); if ($i % 2) { $st2->execute($i, $i * 3); } } $dbh->do('commit'); my $elapsed = tv_interval ( $start_time, [gettimeofday]); print "creation time: $elapsed\n"; __END__ insertion time: 0.263462` [download]	[reply] [d/l] [select]
A reply falls below the community's threshold of quality. You may see it by logging in.


Perl: the Markov chain saw
	PerlMonks