Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid

Strip HTML tags

by rlk (Pilgrim)
on Dec 15, 2000 at 13:40 UTC ( [id://46815]=CUFP: print w/replies, xml ) Need Help??

Yes, I know this is a wheel that's been invented many times before, but I couldn't bring myself to pull out something as complex as HTML::Parser for a little task like this. N.B. that this will probably break badly on broken HTML. (Updated: now parses <foo bar='"'>, etc.) (Update #2: per chipmunk's point about backreferences not working inside character classes, I've split the middle regex into two.)
#With $_ holding the HTML text... #Pull comments. Note that # `<!-- foo="--> bar <--" -->' will NOT strip ` bar '. # I claim this to be a feature. s/<!--.*?-->//g; #for comments like <blah blah="blah" blah='blah' ... >, # strip from after the start of the tag up to the end # of the first quoted string, repeatedly, ending in either # `<>' or `<no quotes here>' # Update: Now handles either quote char, with the other # possibly within the quoted string. while ( s/<(?!--)[^'">]*"[^"]*"/</g or s/<(?!--)[^'">]*'[^']*'/</g) {}; #strip HTML tags without quotes in them... which should be # the only kind that we have left. s/<(?!--)[^">]*>//g; print $_;

Replies are listed 'Best First'.
Re: Strip HTML tags
by davorg (Chancellor) on Dec 15, 2000 at 13:49 UTC

    As discussed on many threads recently, this kind of work is really better left to the professionals - which in this case is HTML::Parser and its subclasses.

    Any regex-based solution is bound to break at some point as your HTML gets more complex.


    "Perl makes the fun jobs fun
    and the boring jobs bearable" - me

      Well, I mentioned that this'd break on bad HTML. Specifically, it assumes all tags have an even number of "outer" quotation marks(1) in them....

      In my defense, the complexity of the HTML is not at issue here, because I'm not really parsing it, I'm simply stripping all tags, which is a much simpler problem.

      (1)I hadn't realized that foo='"' was a legal attribute-value pair originally. This has since been fixed.

      Ryan Koppenhaver, Aspiring Perl Hacker
      "I ask for so little. Just fear me, love me, do as I say and I will be your slave."

Re: Strip HTML tags
by swiftone (Curate) on Dec 16, 2000 at 04:20 UTC
    While I generally agree that HTML::Parser is a pain (the great flexibility leads to great complexity), for something like this, HTML::TreeBuilder is just the ticket. Three simple lines
    my $tree = HTML::TreeBuilder->new; $tree->parse_file('foo.html'); $non_html = $tree->as_text();
    Should do the trick. This quarter's Perl Journal has a good article on it (the included docs need work)
      Warning: This code strips out <anything> that is surrounded by <angle> <brackets>. It does not limit its action to true <html tags>.

      "Perl is a mess and that's good because the
      problem space is also a mess.
      " - Larry Wall

Re: Strip HTML tags
by chipmunk (Parson) on Dec 16, 2000 at 00:13 UTC
    I'm afraid your fix for single quotes won't work as intended; you cannot use backreferences inside a character class. Inside a character class, \1 et al. are octal escapes, so [^\1] matches any character other than \001, also known as control-A.
Re: Strip HTML tags
by epoptai (Curate) on Dec 16, 2000 at 06:41 UTC
    Some regexes that 'work' for this, the first is from the great free code syntax highlighter
    ($text = $html) =~ s/<(\/|!)?[-.a-zA-Z0-9]*.*?>//g;
    These are obvious (but too simple) solutions:
    $text =~ s/<[^>]*>//gs; # only for most simple html! $text =~ s/<([^>]|\n)*>//g; # multi-line comments?
    For in-depth discussion consult Perl Cookbook Recipe 20.6 which recommends using the HTML::Parser and HTML::FormatText modules.

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: CUFP [id://46815]
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others exploiting the Monastery: (6)
As of 2024-04-23 14:25 GMT
Find Nodes?
    Voting Booth?

    No recent polls found