Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

multi threading DBI

by Anonymous Monk
on Oct 29, 2013 at 17:31 UTC ( [id://1060189]=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

i am starting a web scraper project, it scrape a sports news site and then enter details into sqlite. it works perfectly in single thread mode, but its just too slow. i want to upgrade the script to a multithreaded one, but DBI does not support multithreading... what should i do?

Replies are listed 'Best First'.
Re: multi threading DBI
by BrowserUk (Patriarch) on Oct 29, 2013 at 17:36 UTC
    i want to upgrade the script to a multithreaded one, but DBI does not support multithreading... what should i do?

    Do your scraping from multiple threads and use a single separate thread to do the DBI stuff.

    Have the scraping threads post the data to be entered into the DB, to the DBI thread -- as plain text -- via a queue.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      Can u kindly point me to any example code for this thread::queue? i am very new to threads...

        Take your pick


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: multi threading DBI
by sundialsvc4 (Abbot) on Oct 29, 2013 at 20:35 UTC

    Ladies and gentlemen, please forgive my “brashness” in advancing what may at first well appear to be “an irrelevant, sideways, Sundialsvc4-typically-off-the-topic guess ... and ... well ... hear me out on this one.

    SQLite is not slow. ... ... except ... ... except ... ... in one very counter-intuitive case that just happens to be tremendously well-known to me, because for a time it absolutely show-stopped an equally important project of mine that also used this database.   I was shocked at how hideous SQLite’s performance was!   Then, one day, as I was amusing myself with the “activity monitor” while waiting for my script to grind on ... I noticed something.   My script was doing a truly-extraordinary number of disk operations.   The entirety of what was bringing this script to its most-improbable knees was ... disk I/O.

    It was only-then that I looked for, and found (somewhere ...) a discussion of what SQLite actually does and why it does it.   This now-forgotten page discussed how SQLite re-reads what is in memory and how it re-reads after it writes, and how “transactions” are a Magickal Cure for all of that.   I had never encountered a file-based application before that did that.   I tried it for myself and ... well ... “holy sh*t!”   All of the sudden, this thing took-off like the proverbial bat.

    Hence ... my guess.   My guess that “too slow” is in fact due to a cause that I had not suspected, either, when I encountered it and was show-stopped by it for the first time.   My guess that “threading, or the lack/presence thereof,” most likely has nothing to do with it.   I therefore injected this admittedly out-of-band comment, to suggest a completely-different stone that should be turned-over first.

    Before getting started with multi-threading this thing ... try this, first.   Wrap every activity that writes or that reads to this database in a transaction.   (If you are issuing a large number of statements, periodically commit.)   But, do not do anything meaningful outside of one.   Reserve judgment about “what to do with this program” until you have tried this first.   (And if you already have done this, just send me to bed once again without any supper.)

      Frankly, absent profiling tools, I think testing both solutions has merit.

      With profiling tools, the answer will likely become obviously with little to no core code change.

      And were I a betting man, I'd say both solutions are likely to produce improvements, with a smaller bet that multithreading would likely yield a greater gain in the scrapes/minute category. Improving the back end (DB updates/minute) can only be a good thing, so I am inclined to favor doing both in any regard.

      But the multi-threading scraper into a queue with a single-threaded DBI operation sounds like the more fun engineering exercise (and I admit this should be an irrelevant attribute in the decision making process but I had to throw it out there).

      Encapsulating DB updates in transactions is almost no development effort at all (and highlights that it should probably be done anyway).

      How's that for sitting on the fence?  :-)

      Wrap every activity that writes or that reads to this database in a transaction.

      Reading from the database does not require disk-sync calls and wrapping them into a transaction won't help. Besides, transactions themselves don't help one bit unless you wrap multiple writes into a single transaction. (I guess you understand that but failed to say so.)

Re: multi threading DBI
by sundialsvc4 (Abbot) on Oct 29, 2013 at 18:42 UTC

    Nope ... in this case ... wrong answer.   (And this is meant in the most impersonal manner possible, so nobody get their feelings hurt please.)   If “SQLite is too slow,” then the problem is going to turn out to be that:   “you must use transactions.”   Until you do that, SQLite will be dog-slow; after you do that, it’s extremely fast.   Blisteringly fast.

    Outside of the auspices of a transaction, SQLite physically verifies every disk-write, by flushing to disk, re-reading the block and comparing it.   And, nothing gets held in memory.   In-memory copies of a block are not re-used in case the underlying data might have changed.   This is a fundamental design-aspect of this system.   When a transaction is under way, however, SQLite does as much as it can in in-memory buffers, with “lazy writes” and all of the things that you would ordinarily expect.   Whether you are reading or writing, always do anything-and-everything within the scope of a transaction, and your performance problems will be solved.

    Obviously, I am being a little-bit brash here, but I do in this case speak from very first-hand experience.   I had concluded okay, she’s a dog, then discovered just how dramatically(!) wrong I was.   (“Even for reading?!”   Yes, even for reading.   “Will it really make that much difference?”   Yes, that much.)

      All of this is completely orthogonal to the OP's question.

      Thus -- even if it is essentially correct, which is debatable -- makes it totally irrelevant drivel in the context of the question asked.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        As the OP didn't specify, we are left to presume what it is about the process that is deemed to be too slow. Experience does strongly suggest that -- given the scraper is in the same linear thread as the DBI operation upon which it must wait before proceeding to the next scrape -- the part which is troubling the OP is almost certainly the speed with which it proceeds from one scrape operation to the next.

        Thus BrowserUK is likely correct that relief is likely to be found in stacking up the updates guided by a multi-threaded scraper to a single-threaded DBI queue -- it moves the most likely point of contention and delay outside the loop which constrains the operation which is deemed to be, in the OP's words, "too slow".

        However, it is only prudent to point out that if this assumption were to prove to be incorrect, and the user is hoping to improve the speed of the SQLite updates and not merely improve the scraper hang time, that the scraper::queue::DBI model might not produce the gains the OP was hoping for. It's a corner case, almost certainly, but the corner is not imaginary.

        I agree that, absent information to the contrary, it is a side note and not the meat of the response. But I don't think sundialsvc4's response is necessarily out-of-band.

      Ummmmmmm...perhaps I'm misreading, but my understanding of the OP was that the single-threaded script was "slow". However, I didn't see any statements/claims/speculations about which part of the script was slow. Did I miss something?

      I'm the OP. Thanks everyone and just to clarify a bit more: the script is slow because of the single thread internet connection. Yes, writing to sqlite is not fast (compared to most other real database server), but it is insignificant compared to grabbing stuff from the internet. Thanks BrowserUK for the list of stuff, I'm still going thru that, haven't found some source code that I can use yet. Meanwhile, I am actually opening a new dbh in each thread of my scrapper (stupid and not efficient.....)
        Thanks BrowserUK for the list of stuff, I'm still going thru that, haven't found some source code that I can use yet

        If you post your single threaded code -- or a working, cut-down version of it -- I'll multi-thread it for you in the way I outlined; but I'm not going to try and re-create your code based upon your brief description.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1060189]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others about the Monastery: (4)
As of 2024-03-19 02:09 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found