Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

Comment Removal

by lisaw (Beadle)
on Oct 16, 2005 at 18:17 UTC ( #500603=perlquestion: print w/ replies, xml ) Need Help??
lisaw has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks, I am working on a small, flat, text-based program that allows a user to cut and paste html into an entry. I am trying to figure out how to remove comments that might happen to be in the pasted html...for example:
<B>Hello World</B> <!-- This comment would need to be removed from the entry --> <font color=red>It is Sunday!</FONT>
I've tried this but it doesn't work:
$question =~ s/<--[^<>]-->//g;
Any suggestions? Thank you!!

Comment on Comment Removal
Select or Download Code
Re: Comment Removal
by saintmike (Vicar) on Oct 16, 2005 at 18:26 UTC
    Your substitution looks for <-- (you might have wanted <!--) and then matches a single character that's not > or <, followed by -->. Probably not what you want.

    A simple approach (although by no means 100% reliable) would be

    $question =~ s/<!--.*?-->//sg;

    To reliably remove comments from HTML, use HTML::Parser or a related module.

    Or, even fancier, use File::Comments:

    #!/usr/bin/perl -w use strict; use File::Comments; my $stripper = File::Comments->new(); my $stripped = $stripper->stripped("foo.html"); print "$stripped\n";
Re: Comment Removal
by pg (Canon) on Oct 16, 2005 at 18:28 UTC
    use strict; my $question = "<B>Hello World</B>" . "<!-- This comment would need to be removed from the entry -->" . "<font color=red>It is Sunday!</FONT><!--blah-->"; $question =~ s/<!--.*?-->//sg; print $question;
Re: Comment Removal
by rnahi (Curate) on Oct 16, 2005 at 20:26 UTC

    Here is yet another method, based on Regexp::Common:

    use strict; use warnings; use Regexp::Common qw/comment/; # <== my $html = <<'END_HTML'; <p><b>Hello World</b> <!-- This comment would need to be removed from the entry --> </p><font color=red>It is Sunday!</font> <!-- also this--><p>this should stay here</p><!--this should go--> END_HTML $html =~ s/$RE{comment}{HTML}//g; # <== print "$html\n";

    Result:

    <p><b>Hello World</b> </p><font color=red>It is Sunday!</font> <p>this should stay here</p>
Re: Comment Removal
by ambrus (Abbot) on Oct 16, 2005 at 21:50 UTC

    I think you've missed out just a bang and a star:

    $question =~ s/<!--[^<>]*-->//g;
Re: Comment Removal
by TedPride (Priest) on Oct 17, 2005 at 03:39 UTC
    What about nested comments, however? It's much better to use a module for something like this, rather than try to work out your own regex.
    use strict; use warnings; $_ = join '', <DATA>; s/<!--.*?-->//gs; print; __DATA__ <!-- comment -->text <!-- multi-line comment -->text2 <!-- nested <!-- comment --> -->text3
    returns...
    text text2 -->text3
Re: Comment Removal
by gube (Parson) on Oct 17, 2005 at 04:58 UTC

    Hi, see the below code and correct your error. But, safe to remove the comment using File::Comments module

    Update:
    #!/usr/local/bin/perl $question = '<B>Hello World</B> <!-- This comment would need to be removed from the entry --> <font color=red>It is Sunday!</FONT>'; $question =~ s/<!--[^-]+-->//g; print $question; o/p: <B>Hello World</B> <font color=red>It is Sunday!</FONT>

    Regards
    Gube
    $language = defined('perl') ? 'great :)' : 'die :(';

      No, that's wrong either. If the comment contains a dash ('-') your regex won't remove it.

      use YAPE::Regex::Explain; print YAPE::Regex::Explain->new(qr(<!--[^-]+-->) )->explain(); __END__ The regular expression: (?-imsx:<!--[^-]+-->) matches as follows: NODE EXPLANATION ---------------------------------------------------------- (?-imsx: group, but do not capture (case-sensitive) (with ^ and $ matching normally) (with . not matching \n) (matching whitespace and # normally): ---------------------------------------------------------- <!-- '<!--' ---------------------------------------------------------- [^-]+ any character except: '-' (1 or more times (matching the most amount possible)) ---------------------------------------------------------- --> '-->' ---------------------------------------------------------- ) end of grouping ----------------------------------------------------------

      Example:

      $ perl -le '$_="<!--rm me--><p>good</p><!--rm-too-->";s/<!--[^-]+-->// +g;print' <p>good</p><!--rm-too-->
Re: Comment Removal
by saskaqueer (Friar) on Oct 17, 2005 at 15:24 UTC

    Considering that Regexp::Common defines a html comment match as being found via the following regex, it's probably best to stick with a module as such, so as to catch any nasty "unusual" cases of proper html comments.

    # not useful by itself, see update below (?k:(?k:<!)(?k:(?:--(?k:[^-]*(?:-[^-]+)*)--\s*)*)(?k:>))

    update: I hadn't ever seen the (?k:) usage before, and thanks to benizi, I now see it is simply used internally withing Regexp::Common's collection of modules to allow for optional capturing of segments of the common regular expression being used. From the docs:

    To specify such "optional" capturing parentheses within the regular expression associated with create, use the notation (?k:...). Any parentheses of this type will be converted to (...) when the -keep flag is specified, or (?:...) when it is not.

    So unless I've stripped something out that I shouldn't of, the above example would really boil down to the following (non-capturing) regex, useful for stripping html comments:

    /<!--[^-]*-[^-]+*--\s**>/

    second update: Obviously this won't compile, as per fireartist below. My apologies, I was finishing that up with less than a minute of computer time left :)

      That doesn't compile, +* and ** aren't valid. The grouping parenthesis are still needed.

      perl -MData::Dumper -MRegexp::Common -e 'print Dumper qr/$RE{comment}{ +HTML}/'; $VAR1 = qr/(?-xism:(?:(?:<!)(?:(?:--(?:[^-]*(?:-[^-]+)*)--\s*)*)(?:>)) +)/;
Re: Comment Removal
by Spidy (Chaplain) on Oct 17, 2005 at 23:07 UTC
    One method I've used is to use this for comments in my HTML files:
    <!--Section CONTENT-->
    And then replace that with what's inside the comment tags with this:
    $template =~ s/<!--Section (.+?)-->/$1/sg;

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://500603]
Approved by Limbic~Region
Front-paged by Caron
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others perusing the Monastery: (10)
As of 2014-08-23 12:32 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The best computer themed movie is:











    Results (173 votes), past polls