|Perl: the Markov chain saw|
Parsing webpagesby tel2 (Monk)
|on Jan 28, 2013 at 03:14 UTC||Need Help??|
tel2 has asked for the
wisdom of the Perl Monks concerning the following question:
I'm wanting to parse feedback webpages using Perl, extracting some key fields, from an auction site which is a bit like eBay. Here's some sample input data:
As you can see from the right hand side of this webpage, this member has 4 pages of feedback, and I plan to iterate over each of them. For this, I intend to change the "&page=1" part of the URL from 1 to 4, and I can handle that part.
The fields I want to extract are:
I'll be combining the above fields into records, which might initially look something like this:
What I want to know is:
1. What is a good way to parse the HTML pages? I know there are modules like HTML::Parser, HTML::PullParser, HTML::TreeBuilder, etc, but I haven't really used them, and don't want to waste time trying to work out which is best for this kind of task, so if you have experience with such, I'd appreciate your advice. (I know this could be done without such a module, too, and in the past that's how I've usually done that, but this webpage layout is quite complex so maybe a module will make life much easier.)
2. I don't need a full script, but I'd like to see the part which finds/extracts the above mentioned key fields I'm after, please. Ideally this should include extracting the list of page numbers between the ">>" & "<<" signs, which I could pretty easily do without a module, but I'm interested to see how to do it with one.