Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?

Re: fix the problem of the web crawler

by bitingduck (Chaplain)
on Nov 08, 2012 at 16:29 UTC ( #1002943=note: print w/replies, xml ) Need Help??

in reply to fix the problem of the web crawler

In my limited experience with screenscrapers, that failure mode is usually caused by someone at the other end changing the formatting. You're looking for the stuff you want with a regex, rather than an html parser, so you can easily be a victim of very minor changes in the html. Your best bet is probably about 10 minutes of looking at the page source and then revising the regex accordingly. Switching to using HTML::TreeBuilder and taking advantage of predictable page structure and tag attributes might make your script a little more robust (or it might not, depending on who is messing with it at the other end...). I have a scraper that's been running reliably for several years now through a number of changes in the target page's display format since I switched to treebuilder.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1002943]
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (4)
As of 2018-06-19 05:03 GMT
Find Nodes?
    Voting Booth?
    Should cpanminus be part of the standard Perl release?

    Results (111 votes). Check out past polls.