in reply to DBI::SQLite slowness
Assumption: this data you are de-duping is downloaded fresh, daily from TwitFace.
The idea of loading 180 million records into a db on disk in order to de-dup it is ridiculous if you are in any way concerned with speed.
The following shows a 10-line perl script de-duping a 200-million line, 2.8 GB file of 12-digit numbers in a little over 2 1/2 minutes, using less than 30 MB of ram to do so:
C:\test>dir 1054929.dat 20/09/2013 04:22 2,800,000,000 1054929.dat C:\test>wc -l 1054929.dat 200000000 1054929.dat C:\test>head 1054929.dat 100112321443 100135127486 100110839892 100098464584 100098900542 100048844759 100090430059 100018238859 100132791659 100027638968 C:\test>1054929 1054929.dat | wc -l 1379647642.87527 1379647855.6311 113526721
That's processing the 12-digit numbers at a rate of just under 1 million per second.
You cannot even load the data into the DB at 1/100th of that rate; never mind get the de-duped back out.
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
|
---|
Replies are listed 'Best First'. | |
---|---|
Re^2: DBI::SQLite slowness
by McA (Priest) on Sep 20, 2013 at 04:01 UTC | |
by Anonymous Monk on Sep 20, 2013 at 04:59 UTC | |
by Anonymous Monk on Sep 20, 2013 at 13:10 UTC | |
by BrowserUk (Patriarch) on Sep 20, 2013 at 05:17 UTC |
In Section
Seekers of Perl Wisdom