Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

Are there any memory-efficient web scrapers?

by Anonymous Monk
on Aug 13, 2011 at 15:57 UTC ( #920177=perlquestion: print w/ replies, xml ) Need Help??
Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I usually use WWW::Mechanize for my scraping needs, but for my latest project, the memory usage is not acceptable. Yes, I do set stack_depth to 0. The problem is that Mech is storing the responses in memory, and then there's the much larger decoded_content if the response was compressed, and then the copy of the html data as it's passed to HTML::Form. Right now, my single process is at 200MB and I was planning on having multiple simultaneous processes.

I know I can save the response content to a file with :content_file, but then much of Mech's other functionality disappears, such as link and form parsing. Does a memory-efficient alternative exist that writes the content to files and incrementally parses the content?

Comment on Are there any memory-efficient web scrapers?
Re: Are there any memory-efficient web scrapers?
by Anonymous Monk on Aug 13, 2011 at 17:04 UTC

    Right now, my single process is at 200MB

    What size file/webpage are you processing?

      I'm only requesting html documents, so I added a handler to prevent downloading response content if the content type wasn't text/*. But I didn't think to monitor the size, so I'll set the max_size now. But I still think I need to move to something that can scale better. I was hoping something already exists, but I'm up for hacking on an AnyEvent or POE solution that incrementally parses the HTML, as it comes in or from file, with HTML::Parser OR XML::LibXML.

        solution that incrementally parses the HTML

        How do you know this is the bottleneck?

Re: Are there any memory-efficient web scrapers?
by BrowserUk (Pope) on Aug 13, 2011 at 21:24 UTC

    Reading between the lines of your post and making a few assumptions, I think I would go for a different architecture than the one you describe for this.

    Rather than having multiple, all-in-one fetch-parse-store processes, I'd split the concerns into three processes.

    1. A fetch and store to files in a known directory process.

      Unless you need the extras that Mechanize gives you, I'd use the (much) lighter LWP::Simple::getstore() for this. One instances per thread; two or three threads per core; feeding off a common Thread::Queue can easily saturate the bandwidth of most connections in (say) 100 MB.

    2. A single script that monitors the inbound file directory

      would spawn as many concurrent copies of ...

    3. A simple, standalone parse-a-single-HTML-file-and-store-the-results process.

      ... as are either a) required to keep up with the inbound data rate; or b) the box can handle memory and/or processor wise.

      The monitor script could be based on threads & system or a module like Parallel::ForkManager depending upon your OS and preferences.

    This separation of concerns would allow you to easily vary both the number of fetcher threads; and the number of HTML parsers; to allow you to match them to the bandwidth available and the processing power and memory limits of the box(es) this will run on; whilst keeping each of the 3 components very linear, and easy to program.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      When I scrape certain urls, I have to submit the form if found on the page. Separating into separate processing steps will drastically complicate that process, since not only will the content have to be saved to process later, but the entire response so I can reuse the headers. Even then, that might break if the web server is using sessions and the session expires before I can process it.

        I'm not aware of any such scraper. I would first try to subclasss WWW::Mechanize to use some event-based parser or even regular expressions to extract the forms from the response. To save more memory, either do the parsing in the :content_cb callback directly, or store each page to disk and then separately parse the content from there again, either for forms, or for data.

        The current trend within WWW::Mechanize skews somewhat towards using HTML::TreeBuilder for building a DOM, but if you have proposals on how an API would look that sacrifices the content for less memory usage, I'm sure that I am interested, and maybe other people are interested as well.

        One thing I could imagine would be some kind of event-based HTML::Form parser that sits in the content callback of LWP, so that WWW::Mechanize (or whatever subclass) can extract that data no matter what happens to the content afterwards. But I'm not sure how practical that is, as the response sizes I deal with are far smaller.

        Fair enough. Though that sounds more like driving interactive sessions than "web scraping".


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Are there any memory-efficient web scrapers?
by jdrago999 (Pilgrim) on Aug 15, 2011 at 03:14 UTC

    Does a memory-efficient alternative exist that writes the content to files and incrementally parses the content?

    I ran into a similar problem a while back. I wrote WWW::Crawler::Lite as a result. You supply the `on_response` handler which you could use to do anything you want (write the response to disk, parse later/incrementally, etc). You'll be starting a bit closer to the ground on this one (compared to WWW::Mechanize) but its memory footprint is tiny compared to 'Mech.

    You can see an example spider which crawls search.cpan.org in the t/ folder of the module http://cpansearch.perl.org/src/JOHND/WWW-Crawler-Lite-0.003/t/010-basic/

Re: Are there any memory-efficient web scrapers?
by tmaly (Monk) on Aug 15, 2011 at 13:51 UTC

    If you are scrapping a site that supports gzip encoding, you could set your agent to something like that of a Firefox Mozilla value. I have done this before and I do get gzip encoded responses that would cut down on your memory footprint.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://920177]
Approved by davido
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others surveying the Monastery: (8)
As of 2014-12-25 00:30 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (159 votes), past polls