Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

strip HTML tags

by powerman (Friar)
on Apr 23, 2002 at 13:07 UTC ( #161281=snippet: print w/ replies, xml ) Need Help??

Description: This regex was written long time ago, when HTML::Parser work on pure Perl (now HTML::Parser rewritten on C to improve performance). Our regex was many times faster and more accurate than HTML::Parser.
Now I compare again this regex and HTML::Parser 3.25 (new, C version). And my regex win again, but not "many times faster", just "some percents faster". ;-) I can't test who of them is "more accurate" right now.
I have tested them on very small and simple html and on MySQL manual (2.3MB). The results are:
HTMLLoopsregexHTML::Parser
small.html10000.29 sec0.36 sec
mysql.html11.6 sec2.0 sec
Authors: me and asdfgroup.
sub untag {
  local $_ = $_[0] || $_;
# ALGORITHM:
#   find < ,
#       comment <!-- ... -->,
#       or comment <? ... ?> ,
#       or one of the start tags which require correspond
#           end tag plus all to end tag
#       or if \s or ="
#           then skip to next "
#           else [^>]
#   >
  s{
    <               # open tag
    (?:             # open group (A)
      (!--) |       #   comment (1) or
      (\?) |        #   another comment (2) or
      (?i:          #   open group (B) for /i
        ( TITLE  |  #     one of start tags
          SCRIPT |  #     for which
          APPLET |  #     must be skipped
          OBJECT |  #     all content
          STYLE     #     to correspond
        )           #     end tag (3)
      ) |           #   close group (B), or
      ([!/A-Za-z])  #   one of these chars, remember in (4)
    )               # close group (A)
    (?(4)           # if previous case is (4)
      (?:           #   open group (C)
        (?!         #     and next is not : (D)
          [\s=]     #       \s or "="
          ["`']     #       with open quotes
        )           #     close (D)
        [^>] |      #     and not close tag or
        [\s=]       #     \s or "=" with
        `[^`]*` |   #     something in quotes ` or
        [\s=]       #     \s or "=" with
        '[^']*' |   #     something in quotes ' or
        [\s=]       #     \s or "=" with
        "[^"]*"     #     something in quotes "
      )*            #   repeat (C) 0 or more times
    |               # else (if previous case is not (4))
      .*?           #   minimum of any chars
    )               # end if previous char is (4)
    (?(1)           # if comment (1)
      (?<=--)       #   wait for "--"
    )               # end if comment (1)
    (?(2)           # if another comment (2)
      (?<=\?)       #   wait for "?"
    )               # end if another comment (2)
    (?(3)           # if one of tags-containers (3)
      </            #   wait for end
      (?i:\3)       #   of this tag
      (?:\s[^>]*)?  #   skip junk to ">"
    )               # end if (3)
    >               # tag closed
   }{}gsx;          # STRIP THIS TAG
  return $_ ? $_ : "";
}
Comment on strip HTML tags
Download Code
Re: strip HTML tags
by Anonymous Monk on Jun 07, 2002 at 14:44 UTC
    This is an interesting piece of code, but how do you use it?
Re: strip HTML tags
by japhy (Canon) on Jul 23, 2002 at 21:33 UTC
    Backticks (`) are not valid HTML quoting characters.

    _____________________________________________________
    Jeff[japhy]Pinyan: Perl, regex, and perl hacker, who'd like a job (NYC-area)
    s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;

      Yeah, I know. But IE & Netscape parse them as valid quoting characters!! And my goal is parse html tags just like IE do. I wrote this "strip HTML tags" regex for search engine, and I sure what search engine must index text which user will see then he open source page, i.e. text which will be shown on that page by IE.

Back to Snippets Section

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: snippet [id://161281]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others examining the Monastery: (13)
As of 2014-07-10 21:57 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    When choosing user names for websites, I prefer to use:








    Results (217 votes), past polls