Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical

web scraper testing

by alienhuman (Pilgrim)
on Apr 11, 2006 at 20:52 UTC ( #542671=perlquestion: print w/replies, xml ) Need Help??
alienhuman has asked for the wisdom of the Perl Monks concerning the following question:

Howdy Monks,

I've got a web scraping script. It works fine, except it's a PITA to test the program logic, because the conditions under which it scrapes an external web site only happen for about an hour a day (the information being scraped is time sensitive).

In order for me to test the script's logic outside of that one hour a day, I have to fake:

  • successful login to the site
  • successful scrape (with usable data)
  • query to my DB (assembled programatically based on scrape)
  • successful POST to site

I currently accomplish this by setting a "TEST" flag in my code, and at certain junctures testing for it and running different code if I'm testing. Then there's also some bits of code that I just comment/uncomment during tests. I'd like to rewrite the package that contains my obj/methods so that when the object is created as a "test", the usual methods will not scrape the external site, query the DB, etc normally. Instead they'll execute under testing conditions, so that I can test program logic during other times of the day.

Any thoughts on how, generally, to think about writing code to handle this kind of thing?

Thanks in advance,


Using perl 5.8.6 unless otherwise noted. Apache/2.0.54 unless otherwise noted. Fedora Core 4 (2.6.11-1.1369_FC4) unless otherwise noted.

Replies are listed 'Best First'.
Re: web scraper testing
by roboticus (Chancellor) on Apr 12, 2006 at 00:10 UTC
    I'd suggest that you break down the task into a couple of different functions: One would do the login junk and grab the HTML blob. Another would parse the HTML blob and return a field list or SQL statement string or some such.

    Armed thusly, you can then write a simple test module that calls your HTML blob handler with different HTML blobs and verifies that the correct junk is returned. You can also write a simple test fixture using the first chunk to simply grab a set of screens and write their HTML out to a test file (suitable for use with your first test fixture!).

    Divide et impera!


Re: web scraper testing
by eXile (Priest) on Apr 12, 2006 at 03:27 UTC
    you could create some webpages somewhere that mimick various situations you want to scrape and have these be your test-cases.
Re: web scraper testing
by planetscape (Chancellor) on Apr 13, 2006 at 04:37 UTC

      Thanks, just what I was looking for.


      Using perl 5.8.6 unless otherwise noted. Apache/2.0.54 unless otherwise noted. Fedora Core 4 (2.6.11-1.1369_FC4) unless otherwise noted.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://542671]
Approved by kvale
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others surveying the Monastery: (7)
As of 2018-03-22 10:38 GMT
Find Nodes?
    Voting Booth?
    When I think of a mole I think of:

    Results (273 votes). Check out past polls.