Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw

strip HTML tags

by powerman (Friar)
on Apr 23, 2002 at 13:07 UTC ( #161281=snippet: print w/replies, xml ) Need Help??
Description: This regex was written long time ago, when HTML::Parser work on pure Perl (now HTML::Parser rewritten on C to improve performance). Our regex was many times faster and more accurate than HTML::Parser.
Now I compare again this regex and HTML::Parser 3.25 (new, C version). And my regex win again, but not "many times faster", just "some percents faster". ;-) I can't test who of them is "more accurate" right now.
I have tested them on very small and simple html and on MySQL manual (2.3MB). The results are:
small.html10000.29 sec0.36 sec
mysql.html11.6 sec2.0 sec
Authors: me and asdfgroup.
sub untag {
  local $_ = $_[0] || $_;
#   find < ,
#       comment <!-- ... -->,
#       or comment <? ... ?> ,
#       or one of the start tags which require correspond
#           end tag plus all to end tag
#       or if \s or ="
#           then skip to next "
#           else [^>]
#   >
    <               # open tag
    (?:             # open group (A)
      (!--) |       #   comment (1) or
      (\?) |        #   another comment (2) or
      (?i:          #   open group (B) for /i
        ( TITLE  |  #     one of start tags
          SCRIPT |  #     for which
          APPLET |  #     must be skipped
          OBJECT |  #     all content
          STYLE     #     to correspond
        )           #     end tag (3)
      ) |           #   close group (B), or
      ([!/A-Za-z])  #   one of these chars, remember in (4)
    )               # close group (A)
    (?(4)           # if previous case is (4)
      (?:           #   open group (C)
        (?!         #     and next is not : (D)
          [\s=]     #       \s or "="
          ["`']     #       with open quotes
        )           #     close (D)
        [^>] |      #     and not close tag or
        [\s=]       #     \s or "=" with
        `[^`]*` |   #     something in quotes ` or
        [\s=]       #     \s or "=" with
        '[^']*' |   #     something in quotes ' or
        [\s=]       #     \s or "=" with
        "[^"]*"     #     something in quotes "
      )*            #   repeat (C) 0 or more times
    |               # else (if previous case is not (4))
      .*?           #   minimum of any chars
    )               # end if previous char is (4)
    (?(1)           # if comment (1)
      (?<=--)       #   wait for "--"
    )               # end if comment (1)
    (?(2)           # if another comment (2)
      (?<=\?)       #   wait for "?"
    )               # end if another comment (2)
    (?(3)           # if one of tags-containers (3)
      </            #   wait for end
      (?i:\3)       #   of this tag
      (?:\s[^>]*)?  #   skip junk to ">"
    )               # end if (3)
    >               # tag closed
   }{}gsx;          # STRIP THIS TAG
  return $_ ? $_ : "";
Replies are listed 'Best First'.
Re: strip HTML tags
by japhy (Canon) on Jul 23, 2002 at 21:33 UTC
    Backticks (`) are not valid HTML quoting characters.

    Jeff[japhy]Pinyan: Perl, regex, and perl hacker, who'd like a job (NYC-area)
    s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;

      Yeah, I know. But IE & Netscape parse them as valid quoting characters!! And my goal is parse html tags just like IE do. I wrote this "strip HTML tags" regex for search engine, and I sure what search engine must index text which user will see then he open source page, i.e. text which will be shown on that page by IE.
Re: strip HTML tags
by Anonymous Monk on Jun 07, 2002 at 14:44 UTC
    This is an interesting piece of code, but how do you use it?
Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: snippet [id://161281]
[RonW]: I see useless errors liek that from Perl programs, too.
[tye]: well, you certainly don't see strack traces with all of the values removed from Perl.
[thezip]: Howdy RonW. Not in *my* logfiles you won't!
[tye]: and I never claimed anything about 100% on either side.
[RonW]: Presumably because the coders didn't realize that die and warn will append line number and file when the message strings doesn'r end with a line terminator
[tye]: And these are popular and not new python projects.
[thezip]: I had a user complain that they didn't receive an email from one of my processes. I went to the logfile and reproduced the entire contents of the email they would have received. Not my problem.
[tye]: How the heck does a web framework get popular when an unexpected exception gets logged as just "Exception" w/ no details?
[RonW]: There have been times when some one has modified my Perl code and decided to "clean up" my error/warning messages for me by adding line terminators
[RonW]: Because the coders using them like them despite the lack of useful error/warning messages

How do I use this? | Other CB clients
Other Users?
Others surveying the Monastery: (12)
As of 2017-09-21 20:27 GMT
Find Nodes?
    Voting Booth?
    During the recent solar eclipse, I:

    Results (252 votes). Check out past polls.