http://www.perlmonks.org?node_id=926718

pacohope has asked for the wisdom of the Perl Monks concerning the following question:

Hi. I have 300 HTML pages in various states of HTML compliance. I'm basically trying to strip out all the header and footer junk and get all the middle of the document, even with any crappy HTML it might have.

Documents look something like this:

<html>
--stuff--
<head>
--more stuff--
</head>
<body>
--still more stuff--
<div class="myBody">
--all the stuff I want, which might include div tags, too--
</div>
--yet more stuff--
</body>
</html>

I've tried a few things. I know that XML::XPath and XML::XPath::XMLParser get me to the right place. I have an XPath expression that seems to work most of the time. The problem is that I want all the tags and everything--just as it currently is in the file. When I use methods like findvalue() or string_value(), I get just the text without the tags.

I tried HTML::TokeParser::Simple, but I wasn't sure how to do this. I'm hoping I don't have to write some loop that iterates over all the tags and text and prints them out bit by bit. I just want to say "keep everything from this point in the tree on down...".

Ideally, I want to do this without first fixing crappy, non-compliant HTML. I have lots of <p> tags that are used to separate paragraphs (instead of <p>foo</p>). I also have lots of <meta ... > tags instead of <meta... />. These unclosed tags tend to give XML parsers heartburn. I'll preprocess with tidy to make things tidy if I have to.

Update

I got a good enough result by using XML::XPath, XML::XPath::NodeSet, and XML::Parser. The trick seemed to be disentangling XML::Parser and XML::XPath. That is, I needed my own parser object which I used with XML::XPath. The entire script is 200 lines because of the vagaries of my specific input. But here's what I think is the salient bit that worked:

$m::xpath = '/html/body/table/tr/td/div';
my $parser = XML::Parser->new(
  'NoLWP' => 1,
  'NoExpand' =>1,
  'Namespaces' => 0);
my $XP = XML::XPath->new( filename => $inputfile, parser => $parser );
my $body = $XP->findnodes_as_string($m::xpath);

I ended up cheating because I discovered that the XPath expression above gets me the right div. There was a bit more uniformity on the pages (at least the pages I cared about) than I realised.

Thanks to all the suggestions