strip HTML tags

by powerman (Friar)
Description: This regex was written long time ago, when HTML::Parser work on pure Perl (now HTML::Parser rewritten on C to improve performance). Our regex was many times faster and more accurate than HTML::Parser.
Now I compare again this regex and HTML::Parser 3.25 (new, C version). And my regex win again, but not "many times faster", just "some percents faster". ;-) I can't test who of them is "more accurate" right now.
I have tested them on very small and simple html and on MySQL manual (2.3MB). The results are:
small.html10000.29 sec0.36 sec
mysql.html11.6 sec2.0 sec
Authors: me and asdfgroup.
sub untag {
  local $_ = $_[0] || $_;
#   find < ,
#       comment <!-- ... -->,
#       or comment <? ... ?> ,
#       or one of the start tags which require correspond
#           end tag plus all to end tag
#       or if \s or ="
#           then skip to next "
#           else [^>]
#   >
    <               # open tag
    (?:             # open group (A)
      (!--) |       #   comment (1) or
      (\?) |        #   another comment (2) or
      (?i:          #   open group (B) for /i
        ( TITLE  |  #     one of start tags
          SCRIPT |  #     for which
          APPLET |  #     must be skipped
          OBJECT |  #     all content
          STYLE     #     to correspond
        )           #     end tag (3)
      ) |           #   close group (B), or
      ([!/A-Za-z])  #   one of these chars, remember in (4)
    )               # close group (A)
    (?(4)           # if previous case is (4)
      (?:           #   open group (C)
        (?!         #     and next is not : (D)
          [\s=]     #       \s or "="
          ["`']     #       with open quotes
        )           #     close (D)
        [^>] |      #     and not close tag or
        [\s=]       #     \s or "=" with
        `[^`]*` |   #     something in quotes ` or
        [\s=]       #     \s or "=" with
        '[^']*' |   #     something in quotes ' or
        [\s=]       #     \s or "=" with
        "[^"]*"     #     something in quotes "
      )*            #   repeat (C) 0 or more times
    |               # else (if previous case is not (4))
      .*?           #   minimum of any chars
    )               # end if previous char is (4)
    (?(1)           # if comment (1)
      (?<=--)       #   wait for "--"
    )               # end if comment (1)
    (?(2)           # if another comment (2)
      (?<=\?)       #   wait for "?"
    )               # end if another comment (2)
    (?(3)           # if one of tags-containers (3)
      </            #   wait for end
      (?i:\3)       #   of this tag
      (?:\s[^>]*)?  #   skip junk to ">"
    )               # end if (3)
    >               # tag closed
   }{}gsx;          # STRIP THIS TAG
  return $_ ? $_ : "";
Re: strip HTML tags
by japhy (Canon) on Jul 23, 2002 at 21:33 UTC
    Backticks (`) are not valid HTML quoting characters.

    Jeff[japhy]Pinyan: Perl, regex, and perl hacker, who'd like a job (NYC-area)
    s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;

      Yeah, I know. But IE & Netscape parse them as valid quoting characters!! And my goal is parse html tags just like IE do. I wrote this "strip HTML tags" regex for search engine, and I sure what search engine must index text which user will see then he open source page, i.e. text which will be shown on that page by IE.
Re: strip HTML tags
by Anonymous Monk on Jun 07, 2002 at 14:44 UTC
    This is an interesting piece of code, but how do you use it?
