Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

Re: DBI::SQLite slowness

by BrowserUk (Pope)
on Sep 20, 2013 at 03:43 UTC ( #1054946=note: print w/ replies, xml ) Need Help??


in reply to DBI::SQLite slowness

Assumption: this data you are de-duping is downloaded fresh, daily from TwitFace.

The idea of loading 180 million records into a db on disk in order to de-dup it is ridiculous if you are in any way concerned with speed.

The following shows a 10-line perl script de-duping a 200-million line, 2.8 GB file of 12-digit numbers in a little over 2 1/2 minutes, using less than 30 MB of ram to do so:

C:\test>dir 1054929.dat 20/09/2013 04:22 2,800,000,000 1054929.dat C:\test>wc -l 1054929.dat 200000000 1054929.dat C:\test>head 1054929.dat 100112321443 100135127486 100110839892 100098464584 100098900542 100048844759 100090430059 100018238859 100132791659 100027638968 C:\test>1054929 1054929.dat | wc -l 1379647642.87527 1379647855.6311 113526721

That's processing the 12-digit numbers at a rate of just under 1 million per second.

You cannot even load the data into the DB at 1/100th of that rate; never mind get the de-duped back out.


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.


Comment on Re: DBI::SQLite slowness
Download Code
Re^2: DBI::SQLite slowness
by McA (Deacon) on Sep 20, 2013 at 04:01 UTC

    Hi BrowserUk,

    probably it's too early in the morning, but I can't see the script doing the dedup. There seems to be a magic program/script called 1054929. What am I missing?

    Best regards
    McA

        Following up on Re: vec overflow? an adjustable example (mini vec tutorial)

        The @ in the output it produces is used for meaning of "@at", its not an actual array :) for example  0@[ 0][ 0] means the number zero is stored in the first(zero-th,0-th) seen_vecs string, and its the first bit of the string (0-th bit) ; neat how that works, id-zero is zero-th bit, is offset-th-ed-bit :)

        This can help with the vec syntax :) Bit::Vector::Minimal - Object-oriented vec wrapper

      What am I missing?

      Nothing. I didn't post the actual script; just demonstrated that this wasn't an idle boast.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1054946]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others surveying the Monastery: (5)
As of 2014-07-28 22:37 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (210 votes), past polls