Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

need regex help to strip things like embedded C comments

by Eradicatore (Monk)
on Jul 21, 2007 at 22:22 UTC ( #628043=perlquestion: print w/ replies, xml ) Need Help??
Eradicatore has asked for the wisdom of the Perl Monks concerning the following question:

I have always wondered the answer to this question, but never really stuck with it to figure it out. I'm wondering if anyone here can help. Basically, let's say you have this c source code:
/* this is a test * c file /* with an embedded comment */ * in the middle * */ int main(void) { int a; int b; }
Now let's say you want to write a regex that strips out all c comments from the given c file. The way I was thinking about is would be to use non greedy regex to strip out only non-embeeded comments first. And then do multiple passes to ensure you got all comments out.

But then the trick is, I know how to use something like this to do character negation:

$var =~ s/abc[^xyz]*?def//;
That will match stuff like "abcjjjkkklllmmmdef" and make it "abcdef" but it would NOT work if there was a x or y or z in that inside part like "abcjjjkkkxlllmmmdef"

Now back to the comments in C source code. I can't use single character negation. I want to NOT have a two character patern in the middle part.

I've looked at things like negative lookahead or negative lookbehind, but just don't think that works either.

Any regex experts out there that can answer this puzzle?

NOTE: it should also not assume there are no stars in the embedded comment. Or in other words, it should handle this too:

/* this is a test * c file /* with an embedded * multiline * comment */ * in the middle * */ int main(void) { int a; int b; }

Justin Eltoft

"If at all god's gaze upon us falls, its with a mischievous grin, look at him" -- Dave Matthews

Comment on need regex help to strip things like embedded C comments
Select or Download Code
Re: need regex help to strip things like embedded C comments
by jimt (Chaplain) on Jul 21, 2007 at 22:47 UTC
    printf("remember, /* this isn't a comment */"); printf("/* This isn't a comment %s */", /* but this is */ "/* this isn +'t, though */); // /* The code between these lines int x = y + z; // is not commented out. The line level comments take precedent */

    In short, this is a potentially nasty problem. I would strongly recommend you read the very excellent Mastering Regular Expressions from O'Reilly by Jeffrey Friedl.

    Regardless, any solution you come up with using regexes will probably only operate on a carefully crafted subset of the data, so proceed with caution. There are also variations depending upon which implementation of C you're using.

      Thanks for the reply! Yes, I agree there are definitely going to be gradations of how well any solution works here. I dont' need anything perfect. But I'm also mainly wondering about my one specific question. How to get somethinig like character negation but have it be more like "pattern negation". In this case, the pattern I don't want to be inside the non-greedy regex is /* which is the opening to a new c comment.
Re: need regex help to strip things like embedded C comments
by wind (Priest) on Jul 21, 2007 at 23:46 UTC
Re: need regex help to strip things like embedded C comments
by graff (Chancellor) on Jul 22, 2007 at 00:24 UTC
    Frankly, the way to handle this is with a parser -- something that, in effect, marches through character by character, maintains state information, and gives back chunks of the data with categorizations that you want: comment vs. not-comment. (Since it takes two characters to know you've entered or left a comment, the parser needs to know to look for the second character when it sees the first.)

    The state information you need to maintain in this case is the alternation among "not-in-comment-or-quote", "in-quote", and "in-comment". You start out in the first of those, and as soon as you enter either of the others (by detecting an open-quote or open-comment), nothing else matters until you detect the character (pair) that takes you out of that state, putting you back to "not-in-comment-or-quote".

    So look at Parse::RecDescent -- I suspect that someone has already come up with a parser spec to handle C-like comments.

      Thanks all. I agree, a parser can do this. But I figured regex *may* be powerful enough to do it also. I'm pretty darn good at regex, but this one was beyond me.

      I did look at c::scan, but it didn't run for me on windows based (activestate) perl and had about NIL documentation so I skipped that.

      ... I suspect that someone has already come up with a parser spec to handle C-like comments.

      FWIW, Parse::RecDescent comes with a little demo script (demo_decomment_nonlocal.pl), which does about that. It doesn't handle nested comments though (just like C/C++), but it could certainly be extended to handle that, too...

Re: need regex help to strip things like embedded C comments
by brian_d_foy (Abbot) on Jul 22, 2007 at 00:44 UTC

    C doesn't have nested comments, and your code example doesn't compile (in gcc at least. Maybe you have some odd compiler). Are you trying to find the nested comments and remove them so you have compilable code? If so, you'll need to do a lot more work and probably use a parser.

    For legal C comments, there is an answer in perlfaq6, " How do I use a regular expression to strip C style comments from a file", and it includes the regex from Jeffrey Friedl. If anyone has good information to add to that answer, send me the updates. :)

    Good luck :)

    Update: Correct perlfaq6 number, and, as ikegami notes, that FAQ deals with C comments, not nested comments. The OP said he had C source *shrug*

    --
    brian d foy <brian@stonehenge.com>
    Subscribe to The Perl Review
        The OP's comments are not C-style comments. C doesn't support nesting of comments, and neither does the code at the link you provided.
Re: need regex help to strip things like embedded C comments
by syphilis (Canon) on Jul 22, 2007 at 00:44 UTC
    Hi Eradicatore,

    You can't embed comments like that in C:
    C:\_32\C>type try.c /* this is a test * c file /* with an embedded comment */ * in the middle * */ int main(void) { int a; int b; return 0; } C:\_32\C>gcc -o try.exe try.c try.c:3: error: syntax error before "the" try.c:11: error: syntax error before "return" C:\_32\C>
    The comment is deemed to end at the first occurrence of */.

    Cheers,
    Rob
Re: need regex help to strip things like embedded C comments
by ikegami (Pope) on Jul 22, 2007 at 04:20 UTC

    I want to NOT have a two character patern in the middle part.

    (?!re).

    is for regexps what

    [^a]

    is for characters. So

    (?(?!re).)*

    would restrict the presence of the regexp like

    [^a]*

    would restrict the presence of characters where .* would otherwise be used.

    You can combine this trick with the code in the example for (??{...}) in perlre (which handles nested parens).

      Thanks all for the "comments". (pun intended). I agree, this code won't compile. I mostly was just curious about the regex part of my question. Suppose I should have used a more "real" example. :)

      I will test out what you said Ik! Thanks!!

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://628043]
Approved by Joost
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others perusing the Monastery: (4)
As of 2014-11-28 01:57 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My preferred Perl binaries come from:














    Results (191 votes), past polls