Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight

Re: HTML::Parser example wanted...

by andreychek (Parson)
on Jun 26, 2001 at 19:19 UTC ( #91622=note: print w/replies, xml ) Need Help??

in reply to HTML::Parser example wanted...

Actually, there are a bunch of examples that come with the HTML::Parser module, found in the "eg" directory. Taking the code from there, here is an example of how to parse all the text from an HTML document:
#!/usr/bin/perl -w # Extract all plain text from an HTML file use strict; use HTML::Parser 3.00 (); my %inside; sub tag { my($tag, $num) = @_; $inside{$tag} += $num; print " "; # not for all tags } sub text { return if $inside{script} || $inside{style}; print $_[0]; } HTML::Parser->new(api_version => 3, handlers => [start => [\&tag, "tagname, '+1'"], end => [\&tag, "tagname, '-1'"], text => [\&text, "dtext"], ], marked_sections => 1, )->parse_file(shift) || die "Can't open file: $!\n";;
That code is located in eg/htext. After taking a look, you can see that it is event driven. The HTML::Parser->new line has an option in it called "handlers", which tells HTML::Parser which function to call upon seeing a certain tag type. In this case, every start tag calls the function "tag" with the parameters "tagname", which is the actual tagname, and +1, which identifies it as a start tag.

Personally, I have had more luck with HTML::TokeParser, but that isn't the case for everyone I'm sure. I find that HTML::TokeParser is a bit more intuitive for this sort of job, but that is perhaps just the way I think.. or maybe I just wasn't using it right ;-) In any case, good luck.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://91622]
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others avoiding work at the Monastery: (12)
As of 2018-05-23 13:24 GMT
Find Nodes?
    Voting Booth?