DarkBlue has asked for the wisdom of the Perl Monks concerning the following question:

According to perl.com, I should use HTML::Parse to strip HTML from a string (not the regex I had been using, which caused problems under some configurations). I have read the documentation for HTML::Parse and have tried experimenting with a bit of code, but I can't get it to work for me...
use HTML::Parse; $_[0] = parse_html($_[0]); $_[0]->delete
All I want to do is remove any HTML tags from $_[0] - can anyone help with this?


In theory, there is no difference between theory and practise.  But in practise, there is.
Jonathan M. Hollin

2001-03-03 Edit by Corion : Changed title so it dosen't clash with the module.

Replies are listed 'Best First'.
Re: HTML::Parse
by davorg (Chancellor) on Feb 22, 2001 at 21:50 UTC
      Yep, HTML::Parser IS what I should have used. When I read the documentation again I picked up on this, don't know how I managed to miss such a salient fact first time round... :-(
      Anyway, maybe I'm just a crap programmer, but I still couldn't get HTML::Parser to work... So I butchered Tom Christiansen's code (cleaned out the comments) to produce the following sub-routine which works on all HTML I've tested it with thus far...
      sub strip_html { require 5.002; $_[0] =~ s{ <! (.*?) ( -- .*? -- \s* )+ (.*?) > }{ +if ($1 || $3) { "<!$1 $3>"; } }gesx; $_[0] =~ s{ < (?: [^>'"] * | ".* +?" | '.*?' ) + > }{}gsx; $_[0] =~ s{ ( & ( \x23\d+ | \w+ ) ;? ) } { +$entity{$2} || $1 }gex; BEGIN { %entity = (lt=>'<',gt=>'>',amp=>'&', +quot=>'"',nbsp=>chr 160,iexcl=>chr 161,cent=>chr 162,pound=>chr 163,c +urren=>chr 164,yen=>chr 165,brvbar=>chr 166,sect=>chr 167,uml=>chr 16 +8,copy=>chr 169,ordf=>chr 170,laquo=>chr 171,not=>chr 172,shy=>chr 17 +3,reg=>chr 174,macr=>chr 175,deg=>chr 176,plusmn=>chr 177,sup2=>chr 1 +78,sup3=>chr 179,acute=>chr 180,micro=>chr 181,para=>chr 182,middot=> +chr 183,cedil=>chr 184,sup1=>chr 185,ordm=>chr 186,raquo=>chr 187,fra +c14=>chr 188,frac12=>chr 189,frac34=>chr 190,iquest=>chr 191,Agrave=> +chr 192,Aacute=>chr 193,Acirc=>chr 194,Atilde=>chr 195,Auml=>chr 196, +Aring=>chr 197,AElig=>chr 198,Ccedil=>chr 199,Egrave=>chr 200,Eacute= +>chr 201,Ecirc=>chr 202,Euml=>chr 203,Igrave=>chr 204,Iacute=>chr 205 +,Icirc=>chr 206,Iuml=>chr 207,ETH=>chr 208,Ntilde=>chr 209,Ograve=>ch +r 210,Oacute=>chr 211,Ocirc=>chr 212,Otilde=>chr 213,Ouml=>chr 214,ti +mes=>chr 215,Oslash=>chr 216,Ugrave=>chr 217,Uacute=>chr 218,Ucirc=>c +hr 219,Uuml=>chr 220,Yacute=>chr 221,THORN=>chr 222,szlig=>chr 223,ag +rave=>chr 224,aacute=>chr 225,acirc=>chr 226,atilde=>chr 227,auml=>ch +r 228,aring=>chr 229,aelig=>chr 230,ccedil=>chr 231,egrave=>chr 232,e +acute=>chr 233,ecirc=>chr 234,euml=>chr 235,igrave=>chr 236,iacute=>c +hr 237,icirc=>chr 238,iuml=>chr 239,eth=>chr 240,ntilde=>chr 241,ogra +ve=>chr 242,oacute=>chr 243,ocirc=>chr 244,otilde=>chr 245,ouml=>chr +246,divide=>chr 247,oslash=>chr 248,ugrave=>chr 249,uacute=>chr 250,u +circ=>chr 251,uuml=>chr 252,yacute=>chr 253,thorn=>chr 254,yuml=>chr +255); for $chr ( 0 .. 255 ) { $entity{ '#' . $chr } = chr $chr; } } r +eturn $_[0]; }

      I call this with a simple
      $_[0] = &strip_html($_[0]);
      where "$_[0]" is the string that I want to process.
      It works well enough, so I'll leave it there.


      In theory, there is no difference between theory and practise.  But in practise, there is.
      Jonathan M. Hollin
Re: HTML::Parse
by MeowChow (Vicar) on Feb 23, 2001 at 04:18 UTC
    A better solution would be to use the more current HTML::TreeBuilder:
    my $html = ... my $tree = HTML::TreeBuilder->new; $tree->parse($html); my $text = $tree->as_text();
    For my purposes, the as_text method of HTML::Element was somewhat insufficient, as I wanted to exempt specific tags from having their content added to the text, and I wanted to be able to place a delimiter where tags once where. So, I created my own tree2text subroutine:
    # delim => delimiter to replace stripped tags # escaped_delim => text to escape delimiter with if present in content # skip_tags => tags for which content is not added to text, # defaults to 'script' and 'style' # sub tree2text { my ($tree, %options) = @_; my $delim = defined $options{delim} ? $options{delim} : ''; my $esc_delim = $options{escape_delim}; my $skip = $options{skip_tags} || ['script', 'style']; my @skip = map { lc } @$skip; my @nodes = ($tree); my ($tag, $text); $text = ''; while (@nodes) { my $node = shift @nodes; if (!defined $node) { # move along } elsif (!ref $node) { $node =~ s/$delim/$esc_delim/g if defined $esc_delim; $text .= $delim.$node; } else { $tag = $node->tag; next if grep { $tag eq $_ } @skip; unshift @nodes, $node->content_list; } } return $text; }
    You may also find my recent meditation regarding evil spaces to be of value in your endeavour.
                   s aamecha.s a..a\u$&owag.print
Re: Getting HTML::Parse to work (was: HTML::Parse)
by kilinrax (Deacon) on Jun 24, 2003 at 17:20 UTC
    In an act of blatant self-promotion, I'd suggest you use HTML::Strip:
    use HTML::Strip; my $hs = HTML::Strip->new(); my $clean_text = $hs->parse( $raw_html ); $hs->eof;