|Come for the quick hacks, stay for the epiphanies.
To mechanize WWW::Mechanize: a scraping language?by johnnywang (Priest)
|on Aug 25, 2004 at 19:33 UTC
Before I heard about WWW::Mechanize, LWP was my favorite module set. I did lots of website scraping with it, mostly for fun (e.g., reading Yahoo Finance stock message boards in the bubble years, getting stats on eBay, etc.) Now I use WWW::Mechanize, which, although a subclass of LWP::UserAgent, is much easier. I use it mainly for testing web applications with Test::More and Test::DatabaseRow, it works great.
In my LWP days, I always wished to have a way to describe a scraping in a file, and run a general perl script to execute that description, rather than coding for each case. I never did pursue that. Recently I started thinking about it again, now armed with WWW::Mechanize.
What I'm trying to do is to be able to describe a sequence of scraping as, for example:
Then have a driver program to parse this and take the appropriate actions. The advantage is at least to avoid coding, and also to allow a non-perl or non-programmer to do scraping. The following is a very preliminary start (e.g., many commands hardcoded), the purpose to put it here is to first see whether something like this already exists, and to seek your advice/comments. For example, XML doesn't seem to be the right language here since scraping is not usually hierarchical, I'm using xml just to avoid doing my own parsing.
My simple driver program is as follows:
Where the CmdHandler.pm is as follows: