Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

Re: Are there any memory-efficient web scrapers?

by Anonymous Monk
on Aug 13, 2011 at 17:04 UTC ( [id://920182]=note: print w/replies, xml ) Need Help??


in reply to Are there any memory-efficient web scrapers?

Right now, my single process is at 200MB

What size file/webpage are you processing?

  • Comment on Re: Are there any memory-efficient web scrapers?

Replies are listed 'Best First'.
Re^2: Are there any memory-efficient web scrapers?
by Anonymous Monk on Aug 13, 2011 at 19:55 UTC
    I'm only requesting html documents, so I added a handler to prevent downloading response content if the content type wasn't text/*. But I didn't think to monitor the size, so I'll set the max_size now. But I still think I need to move to something that can scale better. I was hoping something already exists, but I'm up for hacking on an AnyEvent or POE solution that incrementally parses the HTML, as it comes in or from file, with HTML::Parser OR XML::LibXML.

      solution that incrementally parses the HTML

      How do you know this is the bottleneck?

        Bottleneck? By that I assume you are referring to processing speed. That is not my primary concern, and I made no mention of that in my question. I am concerned about memory usage when the scraped pages are parsed for forms and links.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://920182]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others lurking in the Monastery: (7)
As of 2024-04-19 09:13 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found