Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much

Locate large HTML paragraphs with XML::LibXML

by merlyn (Sage)
on Sep 11, 2005 at 23:11 UTC ( #491115=snippet: print w/ replies, xml ) Need Help??

Description: Inspired by Re: Extracting paragraphs from html, here's a bit of XML::LibXML code to fetch a web page and dump out all the large paragraphs.
use XML::LibXML;

my $p = XML::LibXML->new;
my $d = do {
  local *STDOUT;
  local *STDERR;
  open STDOUT, ">/dev/null";
  open STDERR, ">/dev/null";
for my $p ($d->findnodes(q{//text()[string-length() > 100]})) {
  print $p->toString;
Comment on Locate large HTML paragraphs with XML::LibXML
Download Code
Replies are listed 'Best First'.
Re: Locate large HTML paragraphs with XML::LibXML
by Errto (Vicar) on Sep 12, 2005 at 14:58 UTC
    I've become a pretty big fan of XML::LibXML, but I haven't tried it on not-well-formed-XML content before. With the recover option, how much of the funniness in typical HTML can it handle (missing end tags, unquoted attribute values, singleton tags like <br>)?

Back to Snippets Section

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: snippet [id://491115]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others musing on the Monastery: (4)
As of 2015-11-27 23:55 GMT
Find Nodes?
    Voting Booth?

    What would be the most significant thing to happen if a rope (or wire) tied the Earth and the Moon together?

    Results (735 votes), past polls