Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

Re^2: Efficient way to handle huge number of records?

by erix (Prior)
on Dec 11, 2011 at 14:11 UTC ( [id://942949]=note: print w/replies, xml ) Need Help??


in reply to Re: Efficient way to handle huge number of records?
in thread Efficient way to handle huge number of records?

Any DB that couldn't handle that few records would not be worthy of the name. Even MySQL or SQLite shoudl easily handle low billions of records without trouble.

I would be quite interested to see SQLite do this. (may even try it myself...)

In the past (last time I tried was, I think, a couple of years ago) SQLite always proved prohibitively slow: loading multimillion-row data was so ridiculously slow (even on fast hardware), that I never bothered with further use.

I'd love to hear that this has improved - SQLite is nice, when it works. Does anyone have recent datapoints?

(As far as I am concerned, Mysql and BerkeleyDB, as oracle products, are not an serious option anymore (I am convinced Oracle will make things worse for non-paying users all the time), but I am interested to know how their performance (or Oracle's itself for that matter) compare to PostgreSQL)

  • Comment on Re^2: Efficient way to handle huge number of records?

Replies are listed 'Best First'.
Re^3: Efficient way to handle huge number of records?
by baxy77bax (Deacon) on Dec 11, 2011 at 17:01 UTC
    That is not true. the SQLite is the fastest DB engine I ever come across. You just need to increase the buffer size of the read input to let say 4 MB within a transaction and you will see that it can import the above values with no problem under a minute where MySQL will take much longer. And since it stores everything in a RAM query time is going to be again much much more faster. So if you or anyone is looking for a fast DB engine without some fancy-shmancy features that i rarely use anyway the SQLite is the way to GO.

    So if you need a db engine that is fast and reliable and can deal with lots of data you will want SQLite.

    Now as far as the initial question goes, you can do something similar to what MySQL does. Yu could split the file into chunks and index chunks by the line numbers so that you know in which line does the header of you sequence appear. Once you did that you need to hash only those indexes. This will reduce the search the number of times prop. to the number of fragments you have after chomping your initial

Re^3: Efficient way to handle huge number of records?
by BrowserUk (Patriarch) on Dec 11, 2011 at 16:37 UTC
    In the past (last time I tried was, I think, a couple of years ago) SQLite always proved prohibitively slow: loading multimillion-row data was so ridiculously slow

    I said "handle" not "handle well" :)

    That said, I had SQLite on my old machine and found that .import file table via the sqlite3.exe was substantially faster than doing inserts via SQL. Whether from the command line utility or via Perl & DBI.

    I wish I could get a 64-bit build for my system.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    The start of some sanity?

Re^3: Efficient way to handle huge number of records?
by flexvault (Monsignor) on Dec 16, 2011 at 14:25 UTC

    erix,

    Sorry to take so long, real work got in the way.

    Hears the updated info, I used BrowserUk's sub to generate the data. On the key part is was 40bytes long, but on the data part it was 320bytes, so I 'substr' it to 80 (If 320 is correct, I can run the tests again). I ran the test for 1_000_000, since for your purposes it didn't matter. Also I multipled the times by 1000 to get the results in milliseconds. I generated the random keys at the beginning, so that they wouldn't be in cache.

    Here it is:

    while ( $cnt < $howmany ) { $key = rndStr( 40, 'a'..'z'); $data = substr(rndStr(80,qw[a c g t]),0,80)); if ( ( ( $cnt % 113 ) == 0 )&&( scalar keys %khash < 10 ) +) { $khash{$key} = 0; } . . . for ( 1 .. 4 ) { foreach $key ( keys %khash ) { $stime = gettimeofday; $ret = $cursor->c_get($key, $data, DB_SET); $etime = sprintf("%.6f",(gettimeofday - $stime) * 1_000 +); print " $key Time: $etime ms\t$hkey\n"; } }

    Running it and Output:

    # time perl Show11M_mod.plx cds_enabled ## Start: VSZ-10292_KB RSS-4828_KB BLOCK: 512 ( 1000000 ) Write: 1049.66578292847 952/sec 1000000 ReadNext: 28.9542100429535 34537/sec Total: 1000000 ## End: VSZ-10292_KB RSS-6284_KB Diff:0|1456_KB BLOCK: 512 rijrxyzhfvfhvpktkiedvmnpwdphswhavejjwqvr Time: 0.164032 ms evxacpuyerimyidhwfqnvqsjqzrdpgwxzywssakk Time: 0.089884 ms qrckdiakaaanjsrnvsswzuebxmtxeaznhpwdqgfn Time: 0.064135 ms pxlyvhbaujsfdwzsdjterlqeiothhpdzljizypbi Time: 0.066996 ms wfbqhvgjnltboojbctaszbaxlcwibjdjgmwzcusu Time: 0.050068 ms ukotkvoceuchbrrdegkixjdegzqclfxbwkdvrnkj Time: 0.043869 ms dcrcpnxnuhfrwmysbxnfmbzqhgeblvoyczoqboef Time: 0.052929 ms xsgzxvlivfwqirwmpjpdnbtifuvjqmbthmgtnbxh Time: 0.050068 ms qntwonibxslleldmlvanodhzlqhweeihlsarfznj Time: 0.053167 ms rpflfufduuqvtkydqswvgnyionloswworrdraplt Time: 0.057936 ms rijrxyzhfvfhvpktkiedvmnpwdphswhavejjwqvr Time: 0.012875 ms evxacpuyerimyidhwfqnvqsjqzrdpgwxzywssakk Time: 0.011921 ms qrckdiakaaanjsrnvsswzuebxmtxeaznhpwdqgfn Time: 0.010967 ms pxlyvhbaujsfdwzsdjterlqeiothhpdzljizypbi Time: 0.010967 ms wfbqhvgjnltboojbctaszbaxlcwibjdjgmwzcusu Time: 0.010967 ms ukotkvoceuchbrrdegkixjdegzqclfxbwkdvrnkj Time: 0.011206 ms dcrcpnxnuhfrwmysbxnfmbzqhgeblvoyczoqboef Time: 0.010967 ms xsgzxvlivfwqirwmpjpdnbtifuvjqmbthmgtnbxh Time: 0.010967 ms qntwonibxslleldmlvanodhzlqhweeihlsarfznj Time: 0.012159 ms rpflfufduuqvtkydqswvgnyionloswworrdraplt Time: 0.010967 ms rijrxyzhfvfhvpktkiedvmnpwdphswhavejjwqvr Time: 0.011921 ms evxacpuyerimyidhwfqnvqsjqzrdpgwxzywssakk Time: 0.012159 ms qrckdiakaaanjsrnvsswzuebxmtxeaznhpwdqgfn Time: 0.012159 ms pxlyvhbaujsfdwzsdjterlqeiothhpdzljizypbi Time: 0.010967 ms wfbqhvgjnltboojbctaszbaxlcwibjdjgmwzcusu Time: 0.010014 ms ukotkvoceuchbrrdegkixjdegzqclfxbwkdvrnkj Time: 0.010967 ms dcrcpnxnuhfrwmysbxnfmbzqhgeblvoyczoqboef Time: 0.010014 ms xsgzxvlivfwqirwmpjpdnbtifuvjqmbthmgtnbxh Time: 0.010967 ms qntwonibxslleldmlvanodhzlqhweeihlsarfznj Time: 0.010967 ms rpflfufduuqvtkydqswvgnyionloswworrdraplt Time: 0.010014 ms rijrxyzhfvfhvpktkiedvmnpwdphswhavejjwqvr Time: 0.011921 ms evxacpuyerimyidhwfqnvqsjqzrdpgwxzywssakk Time: 0.011921 ms qrckdiakaaanjsrnvsswzuebxmtxeaznhpwdqgfn Time: 0.010967 ms pxlyvhbaujsfdwzsdjterlqeiothhpdzljizypbi Time: 0.010967 ms wfbqhvgjnltboojbctaszbaxlcwibjdjgmwzcusu Time: 0.010967 ms ukotkvoceuchbrrdegkixjdegzqclfxbwkdvrnkj Time: 0.010967 ms dcrcpnxnuhfrwmysbxnfmbzqhgeblvoyczoqboef Time: 0.010967 ms xsgzxvlivfwqirwmpjpdnbtifuvjqmbthmgtnbxh Time: 0.010967 ms qntwonibxslleldmlvanodhzlqhweeihlsarfznj Time: 0.010967 ms rpflfufduuqvtkydqswvgnyionloswworrdraplt Time: 0.010967 ms real 18m17.387s user 1m52.459s sys 0m34.850s

    Regards...Ed

    "Well done is better than well said." - Benjamin Franklin

      I used BrowserUk's sub to generate the data. On the key part is was 40bytes long, but on the data part it was 320bytes, so I 'substr' it to 80 ...
      $data = substr(rndStr(80,qw[a c g t]),0,80));

      Sorry, but you must have typo'd or c&p'd my code incorrectly, because there should be no need to substr the output of rndStr():

      sub rndStr{ join'', @_[ map{ rand @_ } 1 .. shift ] };; $x = rndStr( 80, qw[a c g t] );; print length $x, ':', $x;; 80 : actaatcttgcgccgcggcttcatacgagatgaatagtacgaaaacttggatacacctgtatcat +agaagggccgctgcg

      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

      The start of some sanity?

        BrowserUk,

        I downloaded the code sample you provided, and it worked like a charm (after I moved it from dos format to Unix format).

        Originally I did a cut and paste and must have caused the problem...Sorry

        Thank you

        "Well done is better than well said." - Benjamin Franklin

Re^3: Efficient way to handle huge number of records?
by Marshall (Canon) on Dec 19, 2011 at 11:15 UTC
    In the past (last time I tried was, I think, a couple of years ago) SQLite always proved prohibitively slow: loading multimillion-row data was so ridiculously slow (even on fast hardware), that I never bothered with further use.

    I showed the framework of some code at Re^3: Efficient way to handle huge number of records? which is an abbreviated version of some code that I'm currently working on. One table has a million records of 50 fields. So I ran a couple of tests.

    First test was with all the speed-up stuff turned off:

    Starting: Fri Dec 16 13:56:56 2011 Creating new Database!! - HD table records inserted: 1,117,526 Ending: Sun Dec 18 02:15:30 2011
    Now I figure that qualifies as "ridiculously slow!". I actually had to run it twice because I got one of those "Windows Automatic Reboot was required" things! Bummer when that happens after one day of processing!

    Using the optimizations and by FAR and away the biggest effect is to do a single transaction! Results in:

    Starting: Sun Dec 18 15:26:53 2011 Creating new Database!! - HD table records inserted: 1,117,526 Ending: Sun Dec 18 15:29:44 2011
    Or about 3 minutes instead of 2 days! A lot better! This is fast enough for my needs. Using the bulk import utility would probably be faster, but I haven't tested that because 3 minutes doesn't bother my application.

    I have another app that builds a 500K record table and it builds it from 1,000 input files. Takes about 1/2 the time or about 90 seconds. Its not worth my programming effort to emit an intermediate file in the whatever the bulk import utility needs - I just put the data into the DB right away. A reasonable programming tradeoff. Mileage varies.

    It should be noted that my machine is an older one, a hyper threaded one (before the multi-core days), the Prescott stepping - last one with PGA (pin grid array) and my disks are only 5K rpm (not 7K+). A more modern machine can run a single process at least 4x this fast or about 45 seconds instead of 3 minutes (I've bench marked my machine vs a friend's on similar tasks before).

    The time scales linearly, so 11M records would take 10x as long. Is that "ridiculously slow?", I don't know. I guess that depends upon the application.

    I do have a MySQL server running on my machine and in the past I've done some benchmarking vs SQLite. For a complicated query, MySQL is faster, but for my current projects, SQLite "wins" due to admin simplicity (none required!).

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://942949]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chilling in the Monastery: (4)
As of 2024-04-23 19:54 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found