Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Re: fix the problem of the web crawler

by bitingduck (Friar)
on Nov 08, 2012 at 16:29 UTC ( #1002943=note: print w/ replies, xml ) Need Help??


in reply to fix the problem of the web crawler

In my limited experience with screenscrapers, that failure mode is usually caused by someone at the other end changing the formatting. You're looking for the stuff you want with a regex, rather than an html parser, so you can easily be a victim of very minor changes in the html. Your best bet is probably about 10 minutes of looking at the page source and then revising the regex accordingly. Switching to using HTML::TreeBuilder and taking advantage of predictable page structure and tag attributes might make your script a little more robust (or it might not, depending on who is messing with it at the other end...). I have a scraper that's been running reliably for several years now through a number of changes in the target page's display format since I switched to treebuilder.


Comment on Re: fix the problem of the web crawler

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1002943]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others wandering the Monastery: (12)
As of 2014-08-20 09:48 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The best computer themed movie is:











    Results (110 votes), past polls