Beefy Boxes and Bandwidth Generously Provided by pair Networks vroom
There's more than one way to do things
 
PerlMonks  

Re: Re: Re: Strip HTML tags again

by little (Curate)
on Jul 01, 2002 at 12:55 UTC ( #178532=note: print w/ replies, xml ) Need Help??


in reply to Re: Re: Strip HTML tags again
in thread Strip HTML tags again

look up the POD (or your preferred docs) for HTML::Tagset
cite: "hashset %HTML::Tagset::isKnown
This hashset lists all known HTML elements."
So you've got to compare your match with that list ...

Have a nice day
All decision is left to your taste

Addendum

Look through the previous suggestions as well. Try it at least and ask again if you get an error or get otherwise stuck. :-)


Comment on Re: Re: Re: Strip HTML tags again
Re: Re: Re: Re: Strip HTML tags again
by dda (Friar) on Jul 01, 2002 at 13:01 UTC
    The problem is how to extract 'my match' from the regexp shown earlier (or other - please suggest one).. I know about that hashset, and what I need is to apply it to my sub.

    --dda

      Did you look further than ides' suggestion? Did you try Ovid's suggestion?
      </code>
      Have a nice day
      All decision is left to your taste
        Yes, Ovid's solution is fine, and it's rating proves it. But I wanted to hear other ideas too.

        --dda

      Hi ! I think this does what you want:
      use HTML::Tagset; my %tags = %HTML::Tagset::isKnown; my $tagpattern = "(".join('|',keys %tags).")"; print STDERR "$tagpattern\n"; while (<>) { print strip_html_tags($_); } sub strip_html_tags { my $line = shift; $line =~ s/<\s*$tagpattern(?:\s*>|\s+[^>]*>)([^<]*)<\s*\/\1[^>]*>/$2 +/ig; return $line; }
      I first create the string $tagpattern by putting a "|" between all known HTML tags and surrounding the whole thing with parantheses. This will give something like "(a|p|code.....)" and is used later in the subroutine to check for valid HTML tags.

      The regex looks a bit complicated and I am sure that it can be written much better, but I believe it is sufficient for your cause.

      Note that this will only work for tags that are on one line and could get you into trouble if there are < or > signs inside a tag (Don't know if this is possible in HTML).

      update:

      It would propably be a lot wiser to use Ovid's code then my homegrown regex.

      ---- kurt
        I really love your idea! Thanks!

        --dda

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://178532]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others examining the Monastery: (13)
As of 2014-04-18 21:25 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    April first is:







    Results (472 votes), past polls