http://www.perlmonks.org?node_id=1015621

tel2 has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,

I'm wanting to parse feedback webpages using Perl, extracting some key fields, from an auction site which is a bit like eBay. Here's some sample input data:
  www.trademe.co.nz/Members/Feedback.aspx?member=100000&type=&page=1

As you can see from the right hand side of this webpage, this member has 4 pages of feedback, and I plan to iterate over each of them. For this, I intend to change the "&page=1" part of the URL from 1 to 4, and I can handle that part.

The fields I want to extract are:
- Type of feedback. I'll start by just taking the name of the gif file displayed, i.e. "happy_face1.gif", "neutral_face1.gif" or "sad_face1.gif".
- Name of member giving feedback (e.g. "david_parmenter").
- Date of feedback (e.g. "12 Nov 2012").
- Feedback comment (e.g. "good trade").
- Auction number (e.g. "523912464").

I'll be combining the above fields into records, which might initially look something like this:
sad_face.gif|david_parmenter|12 Nov 2012|good trade|523912464
And I'll get all the records on each feedback page for that member, to make a file.

What I want to know is:

1. What is a good way to parse the HTML pages? I know there are modules like HTML::Parser, HTML::PullParser, HTML::TreeBuilder, etc, but I haven't really used them, and don't want to waste time trying to work out which is best for this kind of task, so if you have experience with such, I'd appreciate your advice. (I know this could be done without such a module, too, and in the past that's how I've usually done that, but this webpage layout is quite complex so maybe a module will make life much easier.)

2. I don't need a full script, but I'd like to see the part which finds/extracts the above mentioned key fields I'm after, please. Ideally this should include extracting the list of page numbers between the ">>" & "<<" signs, which I could pretty easily do without a module, but I'm interested to see how to do it with one.

Thanks. tel2