Locate large HTML paragraphs with XML::LibXML


The stupid question is the question not asked
	PerlMonks

Locate large HTML paragraphs with XML::LibXML

by merlyn (Sage)

on Sep 11, 2005 at 23:11 UTC ( [id://491115]=CUFP: print w/replies, xml )

Need Help??

Inspired by Re: Extracting paragraphs from html, here's a bit of XML::LibXML code to fetch a web page and dump out all the large paragraphs.

use XML::LibXML;

my $p = XML::LibXML->new;
$p->recover(1);
my $d = do {
  local *STDOUT;
  local *STDERR;
  open STDOUT, ">/dev/null";
  open STDERR, ">/dev/null";
  $p->parse_html_file("http://www.example.com/some/url");
};
for my $p ($d->findnodes(q{//text()[string-length() > 100]})) {
  print $p->toString;
}
[download]

Comment on Locate large HTML paragraphs with XML::LibXML Download Code

Replies are listed 'Best First'.
Re: Locate large HTML paragraphs with XML::LibXML by Errto (Vicar) on Sep 12, 2005 at 14:58 UTC
I've become a pretty big fan of XML::LibXML, but I haven't tried it on not-well-formed-XML content before. With the `recover` option, how much of the funniness in typical HTML can it handle (missing end tags, unquoted attribute values, singleton tags like `<br>`)?	[reply] [d/l] [select]

Back to Cool Uses for Perl

Domain Nodelet^?

www.com | www.net | www.org

Node Status^?

node history
Node Type: CUFP [id://491115]
help

Chatterbox^?

How do I use this? • Last hour • Other CB clients

Other Users^?

Others admiring the Monastery: (2)

As of 2024-04-20 05:17 GMT

Sections^?

Information^?

Find Nodes^?

Leftovers^?

Today I Learned

Voting Booth^?

No recent polls found