in reply to Re^2: extracting data from HTML
in thread extracting data from HTML
but ofcourse my test website had to come back with an error
One tip for developing scrapers: it's both convenient for you and polite to the site you're scraping to save a local copy that you can hammer at all you want without bothering their server. If you're scraping a lot of pages and doing a lot of tweaking on your code, you have the potential of really hammering someone's server. Once your extractor works, then you can put back the Mechanize calls to the site, which are probably not the hard part
In the example I gave upthread, it would have been ok for me to hammer the site, but I ended up cloning it with wget and running it locally.
Update: You might also want to see if the site you're scraping has an API that hands you structured data. I recently had to pull down the links for about 140 books from the Apple site, and they have a nice API that lets you search by ISBN. Amazon also tends to have an API for a lot of things. Other sites often do as well if you dig through the fine print at the bottom of the page.
|
---|
Replies are listed 'Best First'. | |
---|---|
Re^4: extracting data from HTML
by Jurassic Monk (Acolyte) on Jun 04, 2012 at 18:01 UTC |