http://www.perlmonks.org?node_id=1054946


in reply to DBI::SQLite slowness

Assumption: this data you are de-duping is downloaded fresh, daily from TwitFace.

The idea of loading 180 million records into a db on disk in order to de-dup it is ridiculous if you are in any way concerned with speed.

The following shows a 10-line perl script de-duping a 200-million line, 2.8 GB file of 12-digit numbers in a little over 2 1/2 minutes, using less than 30 MB of ram to do so:

C:\test>dir 1054929.dat 20/09/2013 04:22 2,800,000,000 1054929.dat C:\test>wc -l 1054929.dat 200000000 1054929.dat C:\test>head 1054929.dat 100112321443 100135127486 100110839892 100098464584 100098900542 100048844759 100090430059 100018238859 100132791659 100027638968 C:\test>1054929 1054929.dat | wc -l 1379647642.87527 1379647855.6311 113526721

That's processing the 12-digit numbers at a rate of just under 1 million per second.

You cannot even load the data into the DB at 1/100th of that rate; never mind get the de-duped back out.


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

Replies are listed 'Best First'.
Re^2: DBI::SQLite slowness
by McA (Priest) on Sep 20, 2013 at 04:01 UTC

    Hi BrowserUk,

    probably it's too early in the morning, but I can't see the script doing the dedup. There seems to be a magic program/script called 1054929. What am I missing?

    Best regards
    McA

        Following up on Re: vec overflow? an adjustable example (mini vec tutorial)

        The @ in the output it produces is used for meaning of "@at", its not an actual array :) for example  0@[ 0][ 0] means the number zero is stored in the first(zero-th,0-th) seen_vecs string, and its the first bit of the string (0-th bit) ; neat how that works, id-zero is zero-th bit, is offset-th-ed-bit :)

        This can help with the vec syntax :) Bit::Vector::Minimal - Object-oriented vec wrapper

      What am I missing?

      Nothing. I didn't post the actual script; just demonstrated that this wasn't an idle boast.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.