http://www.perlmonks.org?node_id=56678


in reply to How to get HTML::Parser to return a line of parsed text

Personally, I like japhy's answer, but since you are wanting to know how to do it with HTML::Parser, here is yet another way.
use strict; use LWP::Simple; use HTML::Parser; # get the content of the web page my $content = get("http://www.google.com/"); # instaniate a new parser and let it crunch our data my @lines; my $parser = new MyParser; $parser->parse($content); { package MyParser; use base qw(HTML::Parser); # this method supplies the text, no tags :) sub text { my ($self, $origtext) = @_; print $origtext, "\n"; } }
Unfortunately, this is the OLD way to use HTML::Parser, I haven't learned the new way yet (bad jeffa!). But this should get you going.

UPDATE: If you want to store the contents in a variable, just add

my $stripped_html; # or whatever you wanna call it
Then, inside the text subroutine replace the print line with:
$stripped_html .= $origtext;
I would recommend using an array instead, however:
my @stripped_html; #and inside &text push(@stripped_html, $origtext);
UPDATE: UPDATE: just do what merlyn says :)

Jeff

L-LL-L--L-LL-L--L-LL-L--
-R--R-RR-R--R-RR-R--R-RR
F--F--F--F--F--F--F--F--
(the triplet paradiddle)

Replies are listed 'Best First'.
Re: Re: How to get HTML::Parser to return a line of parsed text
by merlyn (Sage) on Feb 06, 2001 at 20:20 UTC
    The new way would simply be:
    use HTML::Parser; use LWP::Simple; my $html = get "http://perltraining.stonehenge.com"; HTML::Parser->new(text_h => [\my @accum, "text"])->parse($html); print map $_->[0], @accum;

    -- Randal L. Schwartz, Perl hacker


      These are probably extremely stupid questions, but I couldn't find answers to them - I only have merlyn's llama book on my desk; no camel yet.

      First of all, what does the backslash do in \my @accum?

      Secondly, what is the 'map' keyword, and what does the $_->[0] refer to?

      Apologies if these are too trivial...

        You need to pass a reference to an array. I'd probably declare @array first and then use \@array, but merlyn has done both in one step (I didn't know you could do that - never too old to learn new stuff I suppose!)

        map is an operator that takes a block of code and an array. It runs (and returns the results of running) the block of code once for each element in the array. Each element is aliased to $_ within the block.

        When you pass an array ref, instead of a sub ref, to HTML::Parser->new it will store all of the values that would have been parameters to the handler, in the array. In our case it's just one value ('text') but it could well be more. Because of that, the array is in fact an array of arrays (or, more accurately, an array of array references).

        Therefore each time the map block is called, $_ contains a reference to a one-element array and $_->[0] contains the value of the first (and only) element in that array. The whole map call returns the complete list of these elements, effectively flattening the array to one dimension.

        Does that make it clearer?

        --
        <http://www.dave.org.uk>

        "Perl makes the fun jobs fun
        and the boring jobs bearable" - me

        Just by way of handing you a fishing pole, note that for just about any function in Perl, perldoc -f [function_name] will tell you what it does and how to use it. You can also Search on this site to see other Q&A's. I also like to point folks to the Tutorials section, as much wisdom is contained therein.

        Philosophy can be made out of anything. Or less -- Jerry A. Fodor

Re: Re: How to get HTML::Parser to return a line of parsed text
by donfreenut (Sexton) on Feb 06, 2001 at 20:13 UTC

    Unless I don't understand your code as well as I'd like (and I might not - a. I'm new at this and b. I'm using the NEW way :), this is what I'm already doing, more or less. The problem is that your script prints the plaintext, and I want the plaintext in a variable - however that can be accomplished.

    I tried this:

    my $HTMLParser = HTML::Parser->new(text_h => [sub {return shift;}, "te +xt"], default_h => [""]);

    But that, of course, just returns the line of HTML, unparsed.

    Any further ideas, anyone?