Comment Removal

lisaw has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks, I am working on a small, flat, text-based program that allows a user to cut and paste html into an entry. I am trying to figure out how to remove comments that might happen to be in the pasted html...for example:

<B>Hello World</B>
<!-- This comment would need to be removed from the entry -->
<font color=red>It is Sunday!</FONT>
[download]

I've tried this but it doesn't work:

$question =~ s/<--[^<>]-->//g;
[download]

Any suggestions? Thank you!!

Comment on Comment Removal Select or Download Code

Replies are listed 'Best First'.
Re: Comment Removal by saintmike (Vicar) on Oct 16, 2005 at 18:26 UTC
Your substitution looks for `<--` (you might have wanted `<!--`) and then matches a single character that's not `> or <`, followed by `-->`. Probably not what you want. A simple approach (although by no means 100% reliable) would be `$question =~ s/<!--.*?-->//sg;` [download] To reliably remove comments from HTML, use HTML::Parser or a related module. Or, even fancier, use File::Comments: `#!/usr/bin/perl -w use strict; use File::Comments; my $stripper = File::Comments->new(); my $stripped = $stripper->stripped("foo.html"); print "$stripped\n";` [download]	[reply] [d/l] [select]
Re: Comment Removal by rnahi (Curate) on Oct 16, 2005 at 20:26 UTC
Here is yet another method, based on Regexp::Common: `use strict; use warnings; use Regexp::Common qw/comment/; # <== my $html = <<'END_HTML'; <p><b>Hello World</b> <!-- This comment would need to be removed from the entry --> </p><font color=red>It is Sunday!</font> <!-- also this--><p>this should stay here</p><!--this should go--> END_HTML $html =~ s/$RE{comment}{HTML}//g; # <== print "$html\n";` [download] Result: `<p><b>Hello World</b> </p><font color=red>It is Sunday!</font> <p>this should stay here</p>` [download]	[reply] [d/l] [select]
Re: Comment Removal by pg (Canon) on Oct 16, 2005 at 18:28 UTC
`use strict; my $question = "<B>Hello World</B>" . "<!-- This comment would need to be removed from the entry -->" . "<font color=red>It is Sunday!</FONT><!--blah-->"; $question =~ s/<!--.*?-->//sg; print $question;` [download]	[reply] [d/l]
Re: Comment Removal by TedPride (Priest) on Oct 17, 2005 at 03:39 UTC
What about nested comments, however? It's much better to use a module for something like this, rather than try to work out your own regex. `use strict; use warnings; $_ = join '', <DATA>; s/<!--.*?-->//gs; print; __DATA__ <!-- comment -->text <!-- multi-line comment -->text2 <!-- nested <!-- comment --> -->text3` [download] returns... `text text2 -->text3` [download]	[reply] [d/l] [select]
Re^2: Comment Removal by halley (Prior) on Oct 17, 2005 at 15:08 UTC
Nested comments are not supported per W3C standards. http://www.w3.org/TR/html401/intro/sgmltut.html#h-3.2.4 As an aside, `<!-- blah -- >` is a valid comment per current standards. Note the final space. -- `[ e d @ h a l l e y . c c ]`	[reply] [d/l]
Re: Comment Removal by ambrus (Abbot) on Oct 16, 2005 at 21:50 UTC
I think you've missed out just a bang and a star: `$question =~ s/<!--[^<>]*-->//g;` [download]	[reply] [d/l]
Re: Comment Removal by gube (Parson) on Oct 17, 2005 at 04:58 UTC
Hi, see the below code and correct your error. But, safe to remove the comment using File::Comments module Update: `#!/usr/local/bin/perl $question = '<B>Hello World</B> <!-- This comment would need to be removed from the entry --> <font color=red>It is Sunday!</FONT>'; $question =~ s/<!--[^-]+-->//g; print $question; o/p: <B>Hello World</B> <font color=red>It is Sunday!</FONT>` [download] Regards Gube `$language = defined('perl') ? 'great :)' : 'die :(';`	[reply] [d/l] [select]
Re^2: Comment Removal by rnahi (Curate) on Oct 17, 2005 at 06:02 UTC
No, that's wrong either. If the comment contains a dash ('-') your regex won't remove it. use YAPE::Regex::Explain; print YAPE::Regex::Explain->new(qr(<!--[^-]+-->) )->explain(); __END__ The regular expression: (?-imsx:<!--[^-]+-->) matches as follows: NODE EXPLANATION ---------------------------------------------------------- (?-imsx: group, but do not capture (case-sensitive) (with ^ and $ matching normally) (with . not matching \n) (matching whitespace and # normally): ---------------------------------------------------------- <!-- '<!--' ---------------------------------------------------------- [^-]+ any character except: '-' (1 or more times (matching the most amount possible)) ---------------------------------------------------------- --> '-->' ---------------------------------------------------------- ) end of grouping ---------------------------------------------------------- [download] Example: `$ perl -le '$_="<!--rm me--><p>good</p><!--rm-too-->";s/<!--[^-]+-->// +g;print' <p>good</p><!--rm-too-->` [download]	[reply] [d/l] [select]
Re: Comment Removal by saskaqueer (Friar) on Oct 17, 2005 at 15:24 UTC
Considering that Regexp::Common defines a html comment match as being found via the following regex, it's probably best to stick with a module as such, so as to catch any nasty "unusual" cases of proper html comments. `# not useful by itself, see update below (?k:(?k:<!)(?k:(?:--(?k:[^-](?:-[^-]+))--\s))(?k:>))` [download] update: I hadn't ever seen the (?k:) usage before, and thanks to benizi, I now see it is simply used internally withing Regexp::Common's collection of modules to allow for optional capturing of segments of the common regular expression being used. From the docs: To specify such "optional" capturing parentheses within the regular expression associated with create, use the notation (?k:...). Any parentheses of this type will be converted to (...) when the -keep flag is specified, or (?:...) when it is not. So unless I've stripped something out that I shouldn't of, the above example would really boil down to the following (non-capturing) regex, useful for stripping html comments: `/<!--[^-]-[^-]+--\s>/` [download] second update**: Obviously this won't compile, as per fireartist below. My apologies, I was finishing that up with less than a minute of computer time left :)	[reply] [d/l] [select]
Re^2: Comment Removal by fireartist (Chaplain) on Oct 17, 2005 at 20:06 UTC
That doesn't compile, +* and ** aren't valid. The grouping parenthesis are still needed. `perl -MData::Dumper -MRegexp::Common -e 'print Dumper qr/$RE{comment}{ +HTML}/'; $VAR1 = qr/(?-xism:(?:(?:<!)(?:(?:--(?:[^-](?:-[^-]+))--\s))(?:>)) +)/;` [download]	[reply] [d/l]
Re: Comment Removal by Spidy (Chaplain) on Oct 17, 2005 at 23:07 UTC
One method I've used is to use this for comments in my HTML files: `<!--Section CONTENT-->` [download] And then replace that with what's inside the comment tags with this: `$template =~ s/<!--Section (.+?)-->/$1/sg;` [download] My Website	[reply] [d/l] [select]

Back to Seekers of Perl Wisdom