Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Re: fix the problem of the web crawler

by bitingduck (Chaplain)
on Nov 08, 2012 at 16:29 UTC ( #1002943=note: print w/replies, xml ) Need Help??


in reply to fix the problem of the web crawler

In my limited experience with screenscrapers, that failure mode is usually caused by someone at the other end changing the formatting. You're looking for the stuff you want with a regex, rather than an html parser, so you can easily be a victim of very minor changes in the html. Your best bet is probably about 10 minutes of looking at the page source and then revising the regex accordingly. Switching to using HTML::TreeBuilder and taking advantage of predictable page structure and tag attributes might make your script a little more robust (or it might not, depending on who is messing with it at the other end...). I have a scraper that's been running reliably for several years now through a number of changes in the target page's display format since I switched to treebuilder.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1002943]
help
Chatterbox?
[stevieb]: whale oil beef hooked... Strawberry just implemented a JSON structured Perls Available. That fits right in with what I've done with berrybrew quite well me thinks.
[choroba]: Good morning!
[stevieb]: morning, choroba

How do I use this? | Other CB clients
Other Users?
Others chilling in the Monastery: (5)
As of 2017-02-23 03:32 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Before electricity was invented, what was the Electric Eel called?






    Results (338 votes). Check out past polls.