Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine

How to strip HTML using latest module

by f0dder (Novice)
on Aug 23, 2001 at 02:37 UTC ( #107175=perlquestion: print w/replies, xml ) Need Help??
f0dder has asked for the wisdom of the Perl Monks concerning the following question:

I've been using HTML::FormatText to strip HTML tags. The web pages themselves are relatively simple but if possible I would like to use the latest widgets just in case 'simple' becomes 'complex'. Going through the help documents it appears HTML::parse may be what I want.

But going through the Parse documentation and examples most of it makes no sense to me. Can someone post a small example that starts with a call to a $webpage (ie using LWP::Simple) and finishes off by printing the stripped $webpage.

Replies are listed 'Best First'.
Re: How to strip HTML using latest module
by OeufMayo (Curate) on Aug 23, 2001 at 10:06 UTC

    Here's a version using the HTML::Parser v.2 interface:

    #!/usr/bin/perl -w use strict; use LWP::Simple qw(get); use HTML::Parser; my $parser = Example->new(); my $html = get("") or die "Cannot fetch the HTML\n"; $parser->parse($html); package Example; use base qw(HTML::Parser); sub text { my ($self,$text) = @_; print $text; }

    And here's the same script, but using the HTML::Parser version 3 interface. This one is easier to use because you generally don't have to make a new package to parse the html (though you can, if you really want to!).

    #!/usr/bin/perl -w use strict; use LWP::Simple qw(get); use HTML::Parser; my $html = get(""); my $parser = HTML::Parser->new( text_h => [ sub { print shift }, 'dtext' ] ); $parser->parse($html);
    my $OeufMayo = new PerlMonger::Paris({http => ''});</kbd>
      Sweet!!! Thank You. I tried both examples and they work. I now feel so giddy. I also just learned how to turn on autocomplete in the NT cmd shell. This allows bash like autocomplete in both NT & W2k.

      In HKEY_CURRENT_USER|Software|Microsoft|CommandProcessor change CompletionChar to 9
Re: How to strip HTML using latest module
by f0dder (Novice) on Aug 23, 2001 at 03:30 UTC
    I found an example on the net that works. (I renumbered the source to shorten response).
    However can someone post an example where you don't have to go through the intermediate step of reading from a file? Instead of $parser->parse_file($file) do a $parser->parse_text($html). Where $html comes from a LWP::Simple get call and parse_text($html) I made up.
    Are there any alternatives to using parse_text?
    1 #!/usr/bin/perl -w 2 package Example; 3 use strict; 4 5 require HTML::Parser; 6 7 @Example::ISA = qw(HTML::Parser); 8 my $parser = Example->new; 9 $parser->parse_file('index.html'); 10 11 print $parser->{TEXT}; 12 13 sub text 14 { 15 my ($self,$text) = @_; 16 $self->{TEXT} .= $text; 17 }

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://107175]
Approved by root
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (4)
As of 2018-02-24 18:42 GMT
Find Nodes?
    Voting Booth?
    When it is dark outside I am happiest to see ...

    Results (310 votes). Check out past polls.