Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

Regex to strip comments

by zuma53 (Beadle)
on Oct 01, 2012 at 00:11 UTC ( #996552=perlquestion: print w/ replies, xml ) Need Help??
zuma53 has asked for the wisdom of the Perl Monks concerning the following question:

I am trying to remove block comments of the /* comment */ kind.

I've gulped the file into one string and have removed the block comments by using:

$TEXT =~ s/\/\*(.*?)\*\///g

This works most of the time, until, you guessed it, there's an embedded text string with a innocuous '/*' or '*/' somewhere which gums up the regex.

This has probably been asked and answered a thousand times over already, but I have not found a working solution.

Thanks.

Comment on Regex to strip comments
Download Code
Re: Regex to strip comments
by jwkrahn (Monsignor) on Oct 01, 2012 at 00:26 UTC
Re: Regex to strip comments
by GrandFather (Cardinal) on Oct 01, 2012 at 00:36 UTC

    In essence you can't do it with a regex without constraining yourself to the simpler cases. You need to parse a fair chunk of the language to be able to distinguish between comments and things that look like comments in other contexts, such as in strings as you have noted already. However dealing with nested comments is tough on regexen and is the real killer.

    Note that the regex you give won't work as expected across multiple lines because . doesn't match newline characters unless you use the /s option.

    True laziness is hard work
Re: Regex to strip comments
by BrowserUk (Pope) on Oct 01, 2012 at 00:39 UTC

    You could try this:

    $t = 'fred /* bill "jack \" */ \xAB \t john\n" mick */ mary /* /* \u12 +ef */ jane';; ( $s = $t ) =~ s[(/\*(?:"(?:\\\\|\\[abfnrt]|\\u[0-9a-fA-F]{4}|\\x[0-9a +-fA-F]{2,4}|\\"|[^"])+?"|.)*?\*/)][]msg; print $s;; fred mary jane

    There are probably some edge cases I've missed, but they should be fixable.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    RIP Neil Armstrong

      String literals aren't parsed inside of comments, as your code seems to assume. It is only the string literals outside of comments where '/*' needs to be ignored. (And, despite the OP's claim, '*/' in a string literal isn't a problem.)

      - tye        

        (And, despite the OP's claim, '*/' in a string literal isn't a problem.)

        The OP didn't identify the language involved, so I took him at his word.

        Seems I don't have the 'I-know-better-than-the-OP' gene that you and several others around here have. I don't miss it.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        RIP Neil Armstrong

Re: Regex to strip comments
by kcott (Abbot) on Oct 01, 2012 at 00:39 UTC
    "This has probably been asked and answered a thousand times over already, but I have not found a working solution."

    There are tools available to do this. They're called search engines!

    I searched for: Perl regex to strip C comments. The first result provided the answer. Try this yourself - the practice will do you good.

    -- Ken

      Thanks for the suggestion.

      The code is SQL and I am not a C programmer, so I was unaware they are similar.

        Removing C from the search, i.e. Perl regex to strip comments, I still get the same answer as the first result.

        Even removing Perl from the search and using your node title verbatim, i.e. Regex to strip comments, the third result gives the same answer.

        -- Ken

Re: Regex to strip comments
by clueless newbie (Friar) on Oct 01, 2012 at 01:01 UTC

    I'm afraid that it's hardly single regex but the following code works on nested /* /* */ */ and " ... */ ...", etc.

    #!/usr/bin/perl use Data::Dumper; use Params::Validate qw(:types); use strict; use warnings; use 5.10.0; local $/="\n\n"; for my $in (<DATA>) { chomp($in); my ($out,$comment)=RemoveComments($in); say $out; # Now putting the comments back for my $pos (sort { $b <=> $a } keys %$comment) { substr($out,$pos,$comment->{$pos}{length})=$comment->{$pos}{b +ody}; }; print $out."\n\n"; }; exit; sub RemoveComments { # Now handles nested /* */ and -- @_=Params::Validate::validate_pos(@_,{ type=>SCALAR }); my ($in)=@_; my (%comment,$comment_begins); my $stackptr=0; local *foo=sub { @_=Params::Validate::validate_pos(@_,{ type=>SCALAR }, { type +=>SCALAR,default=>0 }); my ($string,$forced)=@_; if ($forced || $stackptr > 0) { $comment{$comment_begins}{length}+=length($string); $comment{$comment_begins}{body}.=$string; $string=~ s{.}{ }mg; }; return $string; }; # foo:; my $out=''; my $pos=0; while ($in !~ m{\G$}cg) { if ($in =~ m{\G((?:/\*)+)}cg) { # /* $comment_begins=$pos if ($stackptr == 0); $stackptr+=length($1)/2; $out.=foo($1); $pos=pos($in); } elsif ($in =~ m{\G((?:\*/)+)}cg) { # */ $out.=foo($1); $stackptr-=length($1)/2; $pos=pos($in); die "Too many closing '*/'! \$stackptr($stackptr) has gon +e negative!\n" if ($stackptr < 0); } elsif ($stackptr == 0 && $in =~ m{\G(--+.*$)}cgm) { # -- co +mment not in a /* */ comment $comment_begins=$pos; $out.=foo($1,1); $pos=pos($in); } elsif ($stackptr > 0 && $in =~ m{\G(--+)}cgs) { # might be +a -- comment but it's in a /* */ comment $out.=foo($1); $pos=pos($in); } elsif ($in =~ m{\G('(?:[^']|'')*'|"(?:[^"]|"")*")}cgs) { #' +# ' or " quoted string $out.=foo($1); $pos=pos($in); } elsif ($in =~ m{\G([^'"]+?(?=\*/|/\*|--|'|"|$))}cgs) { # up + to /*,*/,--,',",\z $out.=foo($1); $pos=pos($in); } else { # Everything should be caught in one of the cases be +fore! warn "WTF!"; my $pos=pos($in); my $residue=substr($in,$pos); die Data::Dumper->Dump([\$pos,\$residue],[qw(*pos *residu +e)]); }; }; return $out,\%comment; }; # RemoveComments: __DATA__ 0/*3--6*/90123456789 01234/*789012*/56789 0/*3456*/90123456789 01--4567890123456789 01234/*789012*/56789 -- /*567890123456789 01234567890123456789 -- */567890123456789 012/*567890123456789 01234/*7890123456789 01234567890123456789 01234*/7890123456789 01*/4567890123456789 '123456/**/12345678' 0'234567--01234567'9 01234567890123456789 '123456/**/12345678' 01234567890123456789 -- /* /* -- */ code code -- */ /* bah */ /* /* */ */ yada /* x */
Re: Regex to strip comments (match strings)
by tye (Cardinal) on Oct 01, 2012 at 04:31 UTC
    s{ (\s*/[*].*?[*]/\s*) # $1: /* comment */ | \s*//[^\n]* # // comment ( # $2: something to keep: | '([^\\']+|\\.)*' # '\t' | "([^\\"]+|\\.)*" # "string" | /(?![/*]) # Non-comment / | [^'"/]+ # Other code ) | (.) # $3: A syntax error (unclosed ' or ") }{ if( defined $3 ) { warn "Ignoring syntax error ($3) at byte ", pos(), $/; } $1 ? ' ' : # "foo /*...*/bar" => "foo bar" defined $2 ? $2 : # Keep non-comment as-is defined $3 ? $3 # Keep syntax error as-is : '' # "foo // ...\n" => "foo\n" }gsex;

    You just have to teach your regex to match things that might contain '/*' characters that don't represent comments. This mostly boils down to string literals. Though, if there is a chance of "// end-of-line" comments, then you have to match those as well. My code above strips them too.

    (Updates made shortly after posting below:)

    If you want to be defensive against mistakes in your regex or in your understanding of the syntax you are trying to parse, then you can add \G(?: and ) around the regex in order to prevent the possibility of it just skipping over unhandled stuff. You can then also specifically match "end of string" for similar reasons. I think the "(.)" case is simple enough that I have little worry of getting that part of the regex wrong and it serves the "misunderstood syntax" and "don't skip bits, including at end of string" purposes well enough.

    - tye        

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://996552]
Approved by GrandFather
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others meditating upon the Monastery: (9)
As of 2014-09-02 06:52 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite cookbook is:










    Results (20 votes), past polls