Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer

Re^3: multi threading DBI

by marinersk (Priest)
on Oct 29, 2013 at 19:28 UTC ( #1060206=note: print w/replies, xml ) Need Help??

in reply to Re^2: multi threading DBI
in thread multi threading DBI

As the OP didn't specify, we are left to presume what it is about the process that is deemed to be too slow. Experience does strongly suggest that -- given the scraper is in the same linear thread as the DBI operation upon which it must wait before proceeding to the next scrape -- the part which is troubling the OP is almost certainly the speed with which it proceeds from one scrape operation to the next.

Thus BrowserUK is likely correct that relief is likely to be found in stacking up the updates guided by a multi-threaded scraper to a single-threaded DBI queue -- it moves the most likely point of contention and delay outside the loop which constrains the operation which is deemed to be, in the OP's words, "too slow".

However, it is only prudent to point out that if this assumption were to prove to be incorrect, and the user is hoping to improve the speed of the SQLite updates and not merely improve the scraper hang time, that the scraper::queue::DBI model might not produce the gains the OP was hoping for. It's a corner case, almost certainly, but the corner is not imaginary.

I agree that, absent information to the contrary, it is a side note and not the meat of the response. But I don't think sundialsvc4's response is necessarily out-of-band.

Replies are listed 'Best First'.
Re^4: multi threading DBI
by Laurent_R (Canon) on Oct 29, 2013 at 22:43 UTC

    I definitely agree with what you said, marinersk. From the context, I would think that what is slow is collecting the data on the Internet, therefore BrowserUK's solution to multithread the web scraping part is certainly the first solution that came to my mind and I believe most probably the right one. Having said that, sundialsvc4's response also makes an interesting point which is worth taking into consideration (and can very easily be tested with minimal code changes).

    The point is that we would really need profiling of the program to figure out what is really slow.

Re^4: multi threading DBI
by BrowserUk (Pope) on Oct 29, 2013 at 19:51 UTC

    Let's see what the OP says for sure, but given "scrape a sports news site and then enter details into sqlite", but:

    1. fetching pages across the internet;
    2. parsing html to extract data;
    3. loading the extracts into a DB;

    Which bit(s) are likely to be the bottleneck? Which bit(s) are likely to be helped by multi-threading?

    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      You summarized my diatribe nicely.  :-)

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1060206]
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others examining the Monastery: (5)
As of 2018-05-24 22:33 GMT
Find Nodes?
    Voting Booth?