http://www.perlmonks.org?node_id=500603

lisaw has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks, I am working on a small, flat, text-based program that allows a user to cut and paste html into an entry. I am trying to figure out how to remove comments that might happen to be in the pasted html...for example:
<B>Hello World</B> <!-- This comment would need to be removed from the entry --> <font color=red>It is Sunday!</FONT>
I've tried this but it doesn't work:
$question =~ s/<--[^<>]-->//g;
Any suggestions? Thank you!!

Replies are listed 'Best First'.
Re: Comment Removal
by saintmike (Vicar) on Oct 16, 2005 at 18:26 UTC
    Your substitution looks for <-- (you might have wanted <!--) and then matches a single character that's not > or <, followed by -->. Probably not what you want.

    A simple approach (although by no means 100% reliable) would be

    $question =~ s/<!--.*?-->//sg;

    To reliably remove comments from HTML, use HTML::Parser or a related module.

    Or, even fancier, use File::Comments:

    #!/usr/bin/perl -w use strict; use File::Comments; my $stripper = File::Comments->new(); my $stripped = $stripper->stripped("foo.html"); print "$stripped\n";
Re: Comment Removal
by rnahi (Curate) on Oct 16, 2005 at 20:26 UTC

    Here is yet another method, based on Regexp::Common:

    use strict; use warnings; use Regexp::Common qw/comment/; # <== my $html = <<'END_HTML'; <p><b>Hello World</b> <!-- This comment would need to be removed from the entry --> </p><font color=red>It is Sunday!</font> <!-- also this--><p>this should stay here</p><!--this should go--> END_HTML $html =~ s/$RE{comment}{HTML}//g; # <== print "$html\n";

    Result:

    <p><b>Hello World</b> </p><font color=red>It is Sunday!</font> <p>this should stay here</p>
Re: Comment Removal
by pg (Canon) on Oct 16, 2005 at 18:28 UTC
    use strict; my $question = "<B>Hello World</B>" . "<!-- This comment would need to be removed from the entry -->" . "<font color=red>It is Sunday!</FONT><!--blah-->"; $question =~ s/<!--.*?-->//sg; print $question;
Re: Comment Removal
by TedPride (Priest) on Oct 17, 2005 at 03:39 UTC
    What about nested comments, however? It's much better to use a module for something like this, rather than try to work out your own regex.
    use strict; use warnings; $_ = join '', <DATA>; s/<!--.*?-->//gs; print; __DATA__ <!-- comment -->text <!-- multi-line comment -->text2 <!-- nested <!-- comment --> -->text3
    returns...
    text text2 -->text3
Re: Comment Removal
by ambrus (Abbot) on Oct 16, 2005 at 21:50 UTC

    I think you've missed out just a bang and a star:

    $question =~ s/<!--[^<>]*-->//g;
Re: Comment Removal
by gube (Parson) on Oct 17, 2005 at 04:58 UTC

    Hi, see the below code and correct your error. But, safe to remove the comment using File::Comments module

    Update:
    #!/usr/local/bin/perl $question = '<B>Hello World</B> <!-- This comment would need to be removed from the entry --> <font color=red>It is Sunday!</FONT>'; $question =~ s/<!--[^-]+-->//g; print $question; o/p: <B>Hello World</B> <font color=red>It is Sunday!</FONT>

    Regards
    Gube
    $language = defined('perl') ? 'great :)' : 'die :(';

      No, that's wrong either. If the comment contains a dash ('-') your regex won't remove it.

      use YAPE::Regex::Explain; print YAPE::Regex::Explain->new(qr(<!--[^-]+-->) )->explain(); __END__ The regular expression: (?-imsx:<!--[^-]+-->) matches as follows: NODE EXPLANATION ---------------------------------------------------------- (?-imsx: group, but do not capture (case-sensitive) (with ^ and $ matching normally) (with . not matching \n) (matching whitespace and # normally): ---------------------------------------------------------- <!-- '<!--' ---------------------------------------------------------- [^-]+ any character except: '-' (1 or more times (matching the most amount possible)) ---------------------------------------------------------- --> '-->' ---------------------------------------------------------- ) end of grouping ----------------------------------------------------------

      Example:

      $ perl -le '$_="<!--rm me--><p>good</p><!--rm-too-->";s/<!--[^-]+-->// +g;print' <p>good</p><!--rm-too-->
Re: Comment Removal
by saskaqueer (Friar) on Oct 17, 2005 at 15:24 UTC

    Considering that Regexp::Common defines a html comment match as being found via the following regex, it's probably best to stick with a module as such, so as to catch any nasty "unusual" cases of proper html comments.

    # not useful by itself, see update below (?k:(?k:<!)(?k:(?:--(?k:[^-]*(?:-[^-]+)*)--\s*)*)(?k:>))

    update: I hadn't ever seen the (?k:) usage before, and thanks to benizi, I now see it is simply used internally withing Regexp::Common's collection of modules to allow for optional capturing of segments of the common regular expression being used. From the docs:

    To specify such "optional" capturing parentheses within the regular expression associated with create, use the notation (?k:...). Any parentheses of this type will be converted to (...) when the -keep flag is specified, or (?:...) when it is not.

    So unless I've stripped something out that I shouldn't of, the above example would really boil down to the following (non-capturing) regex, useful for stripping html comments:

    /<!--[^-]*-[^-]+*--\s**>/

    second update: Obviously this won't compile, as per fireartist below. My apologies, I was finishing that up with less than a minute of computer time left :)

      That doesn't compile, +* and ** aren't valid. The grouping parenthesis are still needed.

      perl -MData::Dumper -MRegexp::Common -e 'print Dumper qr/$RE{comment}{ +HTML}/'; $VAR1 = qr/(?-xism:(?:(?:<!)(?:(?:--(?:[^-]*(?:-[^-]+)*)--\s*)*)(?:>)) +)/;
Re: Comment Removal
by Spidy (Chaplain) on Oct 17, 2005 at 23:07 UTC
    One method I've used is to use this for comments in my HTML files:
    <!--Section CONTENT-->
    And then replace that with what's inside the comment tags with this:
    $template =~ s/<!--Section (.+?)-->/$1/sg;