Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things

How to get HTML::Parser to return a line of parsed text

by donfreenut (Sexton)
on Feb 06, 2001 at 19:49 UTC ( #56674=perlquestion: print w/replies, xml ) Need Help??

donfreenut has asked for the wisdom of the Perl Monks concerning the following question:

I'm using the HTML::Parser class to remove all tags from an HTML document.

After the call to HTMLParser->parse($_); I want to have a line of text, tag-free, as a scalar variable to do with what I choose. HTMLParser->parse(); returns a reference to the parser object.

How can I get that scalar? Should it be done in the event handler for 'text' items?

  • Comment on How to get HTML::Parser to return a line of parsed text

Replies are listed 'Best First'.
(jeffa) Re: How to get HTML::Parser to return a line of parsed text
by jeffa (Bishop) on Feb 06, 2001 at 20:04 UTC
    Personally, I like japhy's answer, but since you are wanting to know how to do it with HTML::Parser, here is yet another way.
    use strict; use LWP::Simple; use HTML::Parser; # get the content of the web page my $content = get(""); # instaniate a new parser and let it crunch our data my @lines; my $parser = new MyParser; $parser->parse($content); { package MyParser; use base qw(HTML::Parser); # this method supplies the text, no tags :) sub text { my ($self, $origtext) = @_; print $origtext, "\n"; } }
    Unfortunately, this is the OLD way to use HTML::Parser, I haven't learned the new way yet (bad jeffa!). But this should get you going.

    UPDATE: If you want to store the contents in a variable, just add

    my $stripped_html; # or whatever you wanna call it
    Then, inside the text subroutine replace the print line with:
    $stripped_html .= $origtext;
    I would recommend using an array instead, however:
    my @stripped_html; #and inside &text push(@stripped_html, $origtext);
    UPDATE: UPDATE: just do what merlyn says :)


    (the triplet paradiddle)
      The new way would simply be:
      use HTML::Parser; use LWP::Simple; my $html = get ""; HTML::Parser->new(text_h => [\my @accum, "text"])->parse($html); print map $_->[0], @accum;

      -- Randal L. Schwartz, Perl hacker

        These are probably extremely stupid questions, but I couldn't find answers to them - I only have merlyn's llama book on my desk; no camel yet.

        First of all, what does the backslash do in \my @accum?

        Secondly, what is the 'map' keyword, and what does the $_->[0] refer to?

        Apologies if these are too trivial...

      Unless I don't understand your code as well as I'd like (and I might not - a. I'm new at this and b. I'm using the NEW way :), this is what I'm already doing, more or less. The problem is that your script prints the plaintext, and I want the plaintext in a variable - however that can be accomplished.

      I tried this:

      my $HTMLParser = HTML::Parser->new(text_h => [sub {return shift;}, "te +xt"], default_h => [""]);

      But that, of course, just returns the line of HTML, unparsed.

      Any further ideas, anyone?
Re: How to get HTML::Parser to return a line of parsed text
by davorg (Chancellor) on Feb 06, 2001 at 20:22 UTC

    It goes something like this:

    #!/usr/bin/perl -w use strict; use HTML::Parser; my $text; my $p = HTML::Parser->new(text_h => [ sub {$text .= shift}, 'dtext']); $p->parse_file('test.html'); print $text;

    which, when used on a file like this:

    <html> <head> <title>Test</title> </head> <body> <h1>Test Stuff</h1> <p>This is a test</p> <ul> <li>this</li> <li>is a</li> <li>list</li> </ul> </body> </html>

    produces the following output:

    Test Test Stuff This is a test this is a list

    Does that help?

    Update: But merlyn's solution is way cooler.


    "Perl makes the fun jobs fun
    and the boring jobs bearable" - me

Re: How to get HTML::Parser to return a line of parsed text
by japhy (Canon) on Feb 06, 2001 at 19:53 UTC
      I've noticed that the display method will still show DTDs and PIs and SSIs and comments. I'd appreciate any feedback as to whether these should be discarded.
      The display method takes a number which represents the depth of tags to show. If you give it 0, no tags are shown. If you give it 1, the top-most layer of tags is shown. (-1, or no value, expands all tags.) Should all non-text items be included in this filtering? Or just tags?

      japhy -- Perl and Regex Hacker

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://56674]
Approved by root
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others meditating upon the Monastery: (3)
As of 2022-08-15 19:45 GMT
Find Nodes?
    Voting Booth?

    No recent polls found