comment on

s{
    (\s*/[*].*?[*]/\s*)     # $1: /* comment */
  | \s*//[^\n]*             # // comment
  (                         # $2: something to keep:
  | '([^\\']+|\\.)*'        #   '\t'
  | "([^\\"]+|\\.)*"        #   "string"
  | /(?![/*])               #   Non-comment /
  | [^'"/]+                 #   Other code
  )
  | (.)                     # $3: A syntax error (unclosed ' or ")
}{
    if(  defined $3  ) {
        warn "Ignoring syntax error ($3) at byte ", pos(), $/;
    }
    $1          ? ' ' :     # "foo  /*...*/bar" => "foo bar"
    defined $2  ? $2  :     # Keep non-comment as-is
    defined $3  ? $3        # Keep syntax error as-is
                : ''        # "foo  // ...\n" => "foo\n"
}gsex;
[download]

You just have to teach your regex to match things that might contain '/*' characters that don't represent comments. This mostly boils down to string literals. Though, if there is a chance of "// end-of-line" comments, then you have to match those as well. My code above strips them too.

(Updates made shortly after posting below:)

If you want to be defensive against mistakes in your regex or in your understanding of the syntax you are trying to parse, then you can add \G(?: and ) around the regex in order to prevent the possibility of it just skipping over unhandled stuff. You can then also specifically match "end of string" for similar reasons. I think the "(.)" case is simple enough that I have little worry of getting that part of the regex wrong and it serves the "misunderstood syntax" and "don't skip bits, including at end of string" purposes well enough.

- tye

In reply to Re: Regex to strip comments (match strings) by tye
in thread Regex to strip comments by zuma53

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Syntactic Confectionery Delight
	PerlMonks