Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

Re: Recommendation on a module for HTML/XML extraction.

by tangent (Parson)
on Aug 16, 2015 at 13:03 UTC ( [id://1138739]=note: print w/replies, xml ) Need Help??


in reply to Recommendation on a module for HTML/XML extraction.

I just need to extract data within certain "class"es, regardless of the tag.
Have a look at HTML::TreeBuilder::XPath - once you get to know Xpath you'll never look back. This should work for your sample data (slightly modified):
use HTML::TreeBuilder::XPath; my $html = q|<div class="message reply"> <span class="profile fn">Person Name</span> <span class="time published" title="2012-03-14T21:37:16+0000">March 14 +, 2012 at 3:37 pm</span> <abbr class="time published" title="2013-03-17T21:37:16+0000">March 17 +, 2013 at 3:37 pm</abbr> <div class="msgbody">Message body here.</div> </div>|; my $tree = HTML::TreeBuilder::XPath->new_from_content($html); my @nodes = $tree->findnodes('//*[@class="time published"]'); for my $node ( @nodes ) { print $node->attr('title'), "\n"; print $node->as_text, "\n"; }
Output:
2012-03-14T21:37:16+0000 March 14, 2012 at 3:37 pm 2013-03-17T21:37:16+0000 March 17, 2013 at 3:37 pm

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1138739]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others imbibing at the Monastery: (9)
As of 2024-04-23 21:55 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found