That's actually the same behaviour as other DB's have.
But only now do I see the initial thread/problem (Scaling Hash Limits). (It's useful to link to original threads in follow-up posts, you know). With the relatively small sizes involved, a database doesn't seem necessary.
If the problem is that simple, can't you just run
sort -u dupslist > no_dupslist
on your id list? Perhaps not very interesting, or fast (took about 7 minutes in a 100M test run here), but about as simple as it gets.
(BTW, just another datapoint (as I did the test already): PostgreSQL (9.4devel) loads about 9000 rows/s, on a slowish, low-end desktop. That's with the laborious INSERT-method that your script uses; bulk-loading (with COPY) loads ~ 1 million rows /second (excluding any de-duplication):
perl -e 'for (1..50_000_000){
printf "%012d\n", $_;
}' > t_data.txt;
echo "
drop table if exists t;
create unlogged table t(klout integer);
" | psql;
echo "copy t from '/tmp/t_data.txt'; " | psql
time < t_data.txt psql -c 'copy t from stdin'
real 0m25.661s
That's a rate of just under 2 million per second
)
UPDATE: added 'unlogged', adjusted timings (it makes the load twice as fast)
|