When faced with this kind of task, a lot of Perl coders:
- see that HTML is not that hard, and figure on parsing it manually;
- find out that HTML is deceptive (or the person or process that writes the file writes lousy HTML) and figure on using a tool;
- discover HTML::Parser, read the doc and say "that's too hard!"
- go back to parsing it manually and come up with something that works
as long as nothing ever changes.
At least, that's how me and my co-workers did it once :)
So as a result, I'd suggest looking at HTML::Parser or one
of its relatives. I used HTML::TreeBuilder to parse
some quite large and unreliable HTML files and found that it
worked great. The tricky bit is learning how to code in the
callback style required, but you can get lots of help on
that here once you've started.