Re: Recommendation on a module for HTML/XML extraction.

I just need to extract data within certain "class"es, regardless of the tag.

Have a look at HTML::TreeBuilder::XPath - once you get to know Xpath you'll never look back. This should work for your sample data (slightly modified):

use HTML::TreeBuilder::XPath;

my $html = q|<div class="message reply">
<span class="profile fn">Person Name</span>
<span class="time published" title="2012-03-14T21:37:16+0000">March 14
+, 2012 at 3:37 pm</span>
<abbr class="time published" title="2013-03-17T21:37:16+0000">March 17
+, 2013 at 3:37 pm</abbr>
<div class="msgbody">Message body here.</div>
</div>|;

my $tree = HTML::TreeBuilder::XPath->new_from_content($html);

my @nodes = $tree->findnodes('//*[@class="time published"]');

for my $node ( @nodes ) {
    print $node->attr('title'), "\n";
    print $node->as_text, "\n";
}
[download]

Output:

2012-03-14T21:37:16+0000
March 14, 2012 at 3:37 pm
2013-03-17T21:37:16+0000
March 17, 2013 at 3:37 pm
[download]

Comment on Re: Recommendation on a module for HTML/XML extraction. Select or Download Code


There's more than one way to do things
	PerlMonks