Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

Stripping of HTML content

by Nemp (Pilgrim)
on Sep 12, 2002 at 15:49 UTC ( #197255=perlquestion: print w/replies, xml ) Need Help??

Nemp has asked for the wisdom of the Perl Monks concerning the following question:

Fellow monks,

I have some input, let's call it @webpage_lines and I want to strip all of the HTML tags from it with as little effort as possible. (remembering of course that laziness is a programmer's virtue)

Relevant facts are;
  • The combined size of $webpagelines[0 .. n] is a few thousand characters maximum.
  • I don't care what the content looks like afterwards as long as all data outside of *any* tag (whether or not valid HTML to be safe) is included.
  • I want this to be as simple as possible with as little code as possible, but I want to 100% sure that nothing remains that could be interpreted by a web-browser as a tag (from an innocent <br> to a malicious script)
At the moment I am doing this;
$page .= $_ for (@webpage_lines); $page =~ s/<[^>]*>//;
Remembering that simplicity and size are paramount...
Should I be looking at some kind of HTML parser instead?
Is this an exremely slow solution because the size $page could become?
Is this code sufficient?
Any other thoughts about this problem?

Thanks for your input!
Neil

Replies are listed 'Best First'.
Re: Stripping of HTML content
by davorg (Chancellor) on Sep 12, 2002 at 16:13 UTC

    As Molt says, parsing HTML with regexes is very fragile and you'd be better off using a real HTML parser to do this.

    Here's a simple example using HTML::Parser.

    use warnings; use strict; use HTML::Parser; my $html = do { local $/; <> }; my @text; my $p = HTML::Parser->new(text_h=> [\@text, 'dtext']); $p->parse($html); print map { $_->[0] } @text;
    --
    <http://www.dave.org.uk>

    "The first rule of Perl club is you do not talk about Perl club."
    -- Chip Salzenberg

Re: Stripping of HTML content
by Ovid (Cardinal) on Sep 12, 2002 at 16:31 UTC

    HTML::TokeParser::Simple. The function you want is in the docs.

    my $html = join '', @webpage_lines; my $p = HTML::TokeParser::Simple->new( \$html ); while ( my $token = $p->get_token ) { # This prints all text in an HTML doc (i.e., it strips the HTML) next if ! $token->is_text; print $token->return_text; }

    Cheers,
    Ovid

    Join the Perlmonks Setiathome Group or just click on the the link and check out our stats.

Re: Stripping of HTML content
by Molt (Chaplain) on Sep 12, 2002 at 16:10 UTC

    Look at a HTML Parser, either HTML::Parser itself or (even more simply) HTML::TokeParser. Your regexps are fragile, they will break.. try feeding <img src="this.gif" alt="<<THIS>>"> into it, watch it fall over screaming.

    If you want a nice full description of HTML parsing, if this is going to be something you're doing a lot of, then peer into 'Perl and LWP' by O'Reilly.

      Hi Molt,

      Thanks for the reply but as I stated in my first post I don't really mind that your line of code would leave me with >"> in my output right now - as long as there are no valid tags left that could alter formatting, run scripts etc. - I'm working on learning this from the ground up :)

      But the book sounds good - I'll look into it for future reference :)

      Thanks!,
      Neil
        Depending on how much inaccuracy you can tolerate, you can get a reasonable facsimile of stripping all HTML by doing:
        $page =~ s/<[^<>]*>//g; # Note the added < inside []
        assuming the entire page content is in $page. A line by line approach like that in your original post will fail on tags that span multiple lines. The regexp above will break if you have unbalanced < or > inside of html tags, but may be good enough for your use.
Re: Stripping of HTML content
by thpfft (Chaplain) on Sep 12, 2002 at 18:38 UTC

    i suppose I should suggest this:

    use HTML::TagFilter; my $tf = new HTML::TagFilter; ... $tf->allow_tags({}); my $text = $tf->filter($html);

    which does exactly the same as the other parser-based solutions, but by way of a subclass that hides much of the unpleasantness. It mght be worth a look if you think you'll want to strip html selectively later on, but otherwise i'd stick with one of the more direct methods described above.

Re: Stripping of HTML content
by mp (Deacon) on Sep 12, 2002 at 16:19 UTC
Re: Stripping of HTML content
by QwertyD (Pilgrim) on Sep 12, 2002 at 21:57 UTC

    One of the simplest ways to go about this would just be to replace "<" with "&lt;", and ">" with "&gt;".

    This way, you don't have to worry about balancing the tag beginnings and endings, and it won't break a message using the angle brackets to mean "less than" and "greater than".


    Update: I probably got the idea from the forum at Joel On Software. Depending on how this is being used and your users, may also want to have a note telling the user that they can't use HTML to format their submission.

    Another Update: Changed the second angle bracket to the closing angle bracket, changed the entities to match.


    How do I love -d? Let me count the ways...

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://197255]
Approved by davis
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others studying the Monastery: (5)
As of 2020-08-12 18:30 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Which rocket would you take to Mars?










    Results (66 votes). Check out past polls.

    Notices?