Parsing webpages

tel2 has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,

I'm wanting to parse feedback webpages using Perl, extracting some key fields, from an auction site which is a bit like eBay. Here's some sample input data:
www.trademe.co.nz/Members/Feedback.aspx?member=100000&type=&page=1

As you can see from the right hand side of this webpage, this member has 4 pages of feedback, and I plan to iterate over each of them. For this, I intend to change the "&page=1" part of the URL from 1 to 4, and I can handle that part.

The fields I want to extract are:
- Type of feedback. I'll start by just taking the name of the gif file displayed, i.e. "happy_face1.gif", "neutral_face1.gif" or "sad_face1.gif".
- Name of member giving feedback (e.g. "david_parmenter").
- Date of feedback (e.g. "12 Nov 2012").
- Feedback comment (e.g. "good trade").
- Auction number (e.g. "523912464").

I'll be combining the above fields into records, which might initially look something like this:
sad_face.gif|david_parmenter|12 Nov 2012|good trade|523912464
And I'll get all the records on each feedback page for that member, to make a file.

What I want to know is:

1. What is a good way to parse the HTML pages? I know there are modules like HTML::Parser, HTML::PullParser, HTML::TreeBuilder, etc, but I haven't really used them, and don't want to waste time trying to work out which is best for this kind of task, so if you have experience with such, I'd appreciate your advice. (I know this could be done without such a module, too, and in the past that's how I've usually done that, but this webpage layout is quite complex so maybe a module will make life much easier.)

2. I don't need a full script, but I'd like to see the part which finds/extracts the above mentioned key fields I'm after, please. Ideally this should include extracting the list of page numbers between the ">>" & "<<" signs, which I could pretty easily do without a module, but I'm interested to see how to do it with one.

Thanks. tel2

Comment on Parsing webpages Download Code

Replies are listed 'Best First'.
Re: Parsing webpages by CountZero (Bishop) on Jan 28, 2013 at 07:29 UTC
"Trade Me" has a published API and it will be much easier to use this API rather than scrape the site. Actually, using the API is the only authorised way to automate access to the website: 4.1.c You may not use a robot, spider, scraper or other unauthorised automated means to access the Website or information featured on it for any purpose. CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James My blog: Imperial Deltronics	[reply]
Re^2: Parsing webpages by tel2 (Pilgrim) on Jan 28, 2013 at 21:52 UTC
Thanks CountZero, Good points. Didn't realise that.	[reply]
Re: Parsing webpages by Anonymous Monk on Jan 28, 2013 at 04:47 UTC
Every other day someone is either trying to parse html, or parse csv -- search, its for everybody :) pars webpag, pars htm, See htmltreexpather.pl , Parsing HTML / Re^4: Parsing HTML, A regex question , NASA's Astronomy Picture of the Day / Re: NASA's Astronomy Picture of the Day , Re: Extracting HTML content between the h tags, Re^2: Help With Online Table Scraper, Re^4: web::scraper using an xpath, .... HTML Parser suggestions See WWW::Mechanize::Firefox::FAQ , WWW::Mechanize::FAQ, WWW::Scripter, Web::Scraper, Web::Magic, Mojo::DOM, HTTP::Recorder, Web Testing with HTTP::Recorder	[reply]
Re^2: Parsing webpages by tel2 (Pilgrim) on Jan 29, 2013 at 04:09 UTC
Thank you Annonymous Monk.	[reply]


Think about Loose Coupling
	PerlMonks