Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Re: multi threading DBI

by sundialsvc4 (Abbot)
on Oct 29, 2013 at 18:42 UTC ( #1060201=note: print w/ replies, xml ) Need Help??


in reply to multi threading DBI

Nope ... in this case ... wrong answer.   (And this is meant in the most impersonal manner possible, so nobody get their feelings hurt please.)   If “SQLite is too slow,” then the problem is going to turn out to be that:   “you must use transactions.”   Until you do that, SQLite will be dog-slow; after you do that, it’s extremely fast.   Blisteringly fast.

Outside of the auspices of a transaction, SQLite physically verifies every disk-write, by flushing to disk, re-reading the block and comparing it.   And, nothing gets held in memory.   In-memory copies of a block are not re-used in case the underlying data might have changed.   This is a fundamental design-aspect of this system.   When a transaction is under way, however, SQLite does as much as it can in in-memory buffers, with “lazy writes” and all of the things that you would ordinarily expect.   Whether you are reading or writing, always do anything-and-everything within the scope of a transaction, and your performance problems will be solved.

Obviously, I am being a little-bit brash here, but I do in this case speak from very first-hand experience.   I had concluded okay, she’s a dog, then discovered just how dramatically(!) wrong I was.   (“Even for reading?!”   Yes, even for reading.   “Will it really make that much difference?”   Yes, that much.)


Comment on Re: multi threading DBI
Re^2: multi threading DBI
by BrowserUk (Pope) on Oct 29, 2013 at 19:13 UTC

    All of this is completely orthogonal to the OP's question.

    Thus -- even if it is essentially correct, which is debatable -- makes it totally irrelevant drivel in the context of the question asked.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      As the OP didn't specify, we are left to presume what it is about the process that is deemed to be too slow. Experience does strongly suggest that -- given the scraper is in the same linear thread as the DBI operation upon which it must wait before proceeding to the next scrape -- the part which is troubling the OP is almost certainly the speed with which it proceeds from one scrape operation to the next.

      Thus BrowserUK is likely correct that relief is likely to be found in stacking up the updates guided by a multi-threaded scraper to a single-threaded DBI queue -- it moves the most likely point of contention and delay outside the loop which constrains the operation which is deemed to be, in the OP's words, "too slow".

      However, it is only prudent to point out that if this assumption were to prove to be incorrect, and the user is hoping to improve the speed of the SQLite updates and not merely improve the scraper hang time, that the scraper::queue::DBI model might not produce the gains the OP was hoping for. It's a corner case, almost certainly, but the corner is not imaginary.

      I agree that, absent information to the contrary, it is a side note and not the meat of the response. But I don't think sundialsvc4's response is necessarily out-of-band.

        Let's see what the OP says for sure, but given "scrape a sports news site and then enter details into sqlite", but:

        1. fetching pages across the internet;
        2. parsing html to extract data;
        3. loading the extracts into a DB;

        Which bit(s) are likely to be the bottleneck? Which bit(s) are likely to be helped by multi-threading?


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        I definitely agree with what you said, marinersk. From the context, I would think that what is slow is collecting the data on the Internet, therefore BrowserUK's solution to multithread the web scraping part is certainly the first solution that came to my mind and I believe most probably the right one. Having said that, sundialsvc4's response also makes an interesting point which is worth taking into consideration (and can very easily be tested with minimal code changes).

        The point is that we would really need profiling of the program to figure out what is really slow.

Re^2: multi threading DBI
by dasgar (Deacon) on Oct 29, 2013 at 19:21 UTC

    Ummmmmmm...perhaps I'm misreading, but my understanding of the OP was that the single-threaded script was "slow". However, I didn't see any statements/claims/speculations about which part of the script was slow. Did I miss something?

Re^2: multi threading DBI
by Anonymous Monk on Oct 30, 2013 at 06:54 UTC
    I'm the OP. Thanks everyone and just to clarify a bit more: the script is slow because of the single thread internet connection. Yes, writing to sqlite is not fast (compared to most other real database server), but it is insignificant compared to grabbing stuff from the internet. Thanks BrowserUK for the list of stuff, I'm still going thru that, haven't found some source code that I can use yet. Meanwhile, I am actually opening a new dbh in each thread of my scrapper (stupid and not efficient.....)
      Thanks BrowserUK for the list of stuff, I'm still going thru that, haven't found some source code that I can use yet

      If you post your single threaded code -- or a working, cut-down version of it -- I'll multi-thread it for you in the way I outlined; but I'm not going to try and re-create your code based upon your brief description.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        Thanks for offering, I have cut out the main part here:
        my $dbh = DBI->connect("dbi:SQLite:dbname=event.db3", "", "", { RaiseE +rror => 1, AutoCommit => 0 }, ) or die $DBI::errstr; my @arr_uname = @{$dbh->selectcol_arrayref("select uname from event")} +; $dbh->disconnect(); foreach $uname(@arr_uname) { ### ### all the LWP::Simple and the parsing happens here ### ### [deleted] ### output result if ($result ne '') { $dbh->do("INSERT OR IGNORE INTO detail_info (id,uname,url,thumb, +txt) VALUES (?,?,?,?,?)", undef, $$result{'id'},$uname,$$result{'url' +},$$result{'thumb'},$$result{'txt'}); $dbh->do("INSERT OR IGNORE INTO score_info (id,uname,tid1,tid2,t +eam1,team2,) VALUES (?,?,?,?,?,?)", undef, $$result{'id'},$uname,$$re +sult{'tid1'},$$result{'tid2'},$$result{'team1'},$$result{'team2'}); $dbh->commit; } }

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1060201]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others imbibing at the Monastery: (6)
As of 2014-09-23 06:26 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (210 votes), past polls