Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

Locate large HTML paragraphs with XML::LibXML

by merlyn (Sage)
on Sep 11, 2005 at 23:11 UTC ( [id://491115]=CUFP: print w/replies, xml ) Need Help??

Inspired by Re: Extracting paragraphs from html, here's a bit of XML::LibXML code to fetch a web page and dump out all the large paragraphs.
use XML::LibXML; my $p = XML::LibXML->new; $p->recover(1); my $d = do { local *STDOUT; local *STDERR; open STDOUT, ">/dev/null"; open STDERR, ">/dev/null"; $p->parse_html_file("http://www.example.com/some/url"); }; for my $p ($d->findnodes(q{//text()[string-length() > 100]})) { print $p->toString; }

Replies are listed 'Best First'.
Re: Locate large HTML paragraphs with XML::LibXML
by Errto (Vicar) on Sep 12, 2005 at 14:58 UTC
    I've become a pretty big fan of XML::LibXML, but I haven't tried it on not-well-formed-XML content before. With the recover option, how much of the funniness in typical HTML can it handle (missing end tags, unquoted attribute values, singleton tags like <br>)?

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: CUFP [id://491115]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chanting in the Monastery: (7)
As of 2024-04-19 10:38 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found