strip HTML tags

This regex was written long time ago, when HTML::Parser work on pure Perl (now HTML::Parser rewritten on C to improve performance). Our regex was many times faster and more accurate than HTML::Parser.
Now I compare again this regex and HTML::Parser 3.25 (new, C version). And my regex win again, but not "many times faster", just "some percents faster". ;-) I can't test who of them is "more accurate" right now.
I have tested them on very small and simple html and on MySQL manual (2.3MB). The results are:

HTML	Loops	regex	HTML::Parser
small.html	1000	0.29 sec	0.36 sec
mysql.html	1	1.6 sec	2.0 sec

Authors: me and asdfgroup.

sub untag {
  local $_ = $_[0] || $_;
# ALGORITHM:
#   find < ,
#       comment <!-- ... -->,
#       or comment <? ... ?> ,
#       or one of the start tags which require correspond
#           end tag plus all to end tag
#       or if \s or ="
#           then skip to next "
#           else [^>]
#   >
  s{
    <               # open tag
    (?:             # open group (A)
      (!--) |       #   comment (1) or
      (\?) |        #   another comment (2) or
      (?i:          #   open group (B) for /i
        ( TITLE  |  #     one of start tags
          SCRIPT |  #     for which
          APPLET |  #     must be skipped
          OBJECT |  #     all content
          STYLE     #     to correspond
        )           #     end tag (3)
      ) |           #   close group (B), or
      ([!/A-Za-z])  #   one of these chars, remember in (4)
    )               # close group (A)
    (?(4)           # if previous case is (4)
      (?:           #   open group (C)
        (?!         #     and next is not : (D)
          [\s=]     #       \s or "="
          ["`']     #       with open quotes
        )           #     close (D)
        [^>] |      #     and not close tag or
        [\s=]       #     \s or "=" with
        `[^`]*` |   #     something in quotes ` or
        [\s=]       #     \s or "=" with
        '[^']*' |   #     something in quotes ' or
        [\s=]       #     \s or "=" with
        "[^"]*"     #     something in quotes "
      )*            #   repeat (C) 0 or more times
    |               # else (if previous case is not (4))
      .*?           #   minimum of any chars
    )               # end if previous char is (4)
    (?(1)           # if comment (1)
      (?<=--)       #   wait for "--"
    )               # end if comment (1)
    (?(2)           # if another comment (2)
      (?<=\?)       #   wait for "?"
    )               # end if another comment (2)
    (?(3)           # if one of tags-containers (3)
      </            #   wait for end
      (?i:\3)       #   of this tag
      (?:\s[^>]*)?  #   skip junk to ">"
    )               # end if (3)
    >               # tag closed
   }{}gsx;          # STRIP THIS TAG
  return $_ ? $_ : "";
}
[download]

Comment on strip HTML tags Download Code

Replies are listed 'Best First'.
Re: strip HTML tags by japhy (Canon) on Jul 23, 2002 at 21:33 UTC
Backticks (`) are not valid HTML quoting characters. _____________________________________________________ Jeff`[japhy]`Pinyan: Perl, regex, and perl hacker, who'd like a job (NYC-area) `s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;`	[reply]
Re: Re: strip HTML tags by powerman (Friar) on Aug 21, 2002 at 17:30 UTC
Yeah, I know. But IE & Netscape parse them as valid quoting characters!! And my goal is parse html tags just like IE do. I wrote this "strip HTML tags" regex for search engine, and I sure what search engine must index text which user will see then he open source page, i.e. text which will be shown on that page by IE.	[reply]
A reply falls below the community's threshold of quality. You may see it by logging in.


We don't bite newbies here... much
	PerlMonks