Extracting C Style Comments Revised (JavaScript)

Incognito has asked for the wisdom of the Perl Monks concerning the following question:

Introduction

This is the code taken from Mastering Regular Expressions book, used to remove all comments from a file (stored in a string).

$data =~ s{                 # First, we'll list things we want 
                            # to match, but not throw away
  (
    [^"'/]+                                # other stuff
    |                                      # -or-
    (?:"[^"\\]*(?:\\.[^"\\]*)*" [^"'/]*)+  # double quoted string
    |                                      # -or-
    (?:'[^'\\]*(?:\\.[^'\\]*)*' [^"'/]*)+  # single quoted constant
  )
    
  |
   
  # or we'll match a comment. Since it's not in the
  # $1 parentheses above, the comments will disappear
  # when we use $1 as the replacement text.
  
    /                       # (all comments start with a slash)
    (?:                           
    \*[^*]*\*+(?:[^/*][^*]*\*+)*/ # traditional C comments
    |                             # -or-
    /[^\n]*                       # C++ //-style comments
    )
      
}{$1}gsx;
[download]

Problem

When a JavaScript file contains regular expressions, say a replace statement on a string:

// Here are some comments
var strText = "fee fi fo fum"; // More
strText = strText.replace (/fee/i, "pee"); // More
alert (strText); /* More */
[download]

problems can arise with this parsing mechanism. The problem? Regular expressions with quotes. For example:

// Here are some comments
var strText = "My \"big\" example"; // More
strText = strText.replace (/"/gi, "'"); // More
alert (strText); /* More */
[download]

If the JavaScript being parsed contained a quote in a regex, then it thinks we have a double quoted or single quoted string, so the comments are left in the file (which is not what we want).

Question

How do we modify this beautiful comment-extracting regex, which generally works on 75%+ of the JavaScript out there, to handle regular expressions such as split(), match(), replace(), search() and test()?

Ultimately the goal would be to parse through this simple JavaScript file containing comments and functions using the regular expressions mentioned above, and leaving it only with pure code (no comments):

Sample JavaScript File

function BadQuoteTest (strInput) {
    strInput = strInput.replace(/"/gi, "'"); // aka aaa;
/*
    This stuff is commented out so should be parsed out
    strInput = strInput.replace(/x/gi, "y"); // aka bbb;
*/

    return (strInput);
}

function splitTest (strInput) {
    var pattern = /\s*;\s*/gi;
    return (strInput.split (pattern));    // Test
}

function splitTestWithLimit (strInput) {
    return (strInput.split (/\"/gi, 3));    // Test
}

function matchTest () {
    var strText = 'Cool text';
    strText = strText.match (/oo/gi);    // Test
    alert (strText);
}

function searchTest () {
    var strText = "Search text";    // Test
    strText = strText.search (/x/gi);    // Test
    alert (strText);
}
[download]

Comment on Extracting C Style Comments Revised (JavaScript) Select or Download Code

Replies are listed 'Best First'.
(tye)Re: Extracting C Style Comments Revised (JavaScript) by tye (Sage) on Oct 23, 2001 at 23:05 UTC
Don't use a single regex. Have a regex for each type of item and parse them out as you go (stealing from myself): while( $code !~ m#\G$#gc ) { if( $code =~ m#\G//(.)\n#gc ) { # $1 is end-of-line comment } elsif( $code =~ m#\G"((?:[^"\\]\|\\.))"#gc ) { # $1 is the inside double quotes } elsif( $code =~ m#\G'((?:[^'\\]\|\\.))'#gc ) { # $1 is the inside single quotes } elsif( $code =~ m#\G/(.?)/#gmc ) { # $1 is a comment } elsif( $code =~ m#\G/((?:[^/\\]\|\\.))/#gc ) { # $1 is a regex } elsif( $code =~ m#\G([^/'"]+)#gc ) { # $1 is "other code" } elsif( $code =~ m#\G/#gc ) { # division, we hope. } else { # We have hit invalid code? } } [download] But this can still be fooled, though it is much harder. (: Update: The only real problem (based on some guesses since I don't know JavaScript syntax) is determining whether a / is starting a regex or is denoting division (I'm assuming that "//" isn't a valid regex and is always an end-of-line comment). This is similar to the problem with parsing Perl, knowing whether the next thing is supposed to be a term or an operator. Perl makes this extra hard because function prototypes can change whether the parser expects a term or operator to follow that function invocation. I doubt JavaScript makes things that hard so someone who understands the syntax could probably fix up my code to work 100% of the time. To go beyond this, I'd probably start looking into Parse::RecDescent. - tye (but my friends call me "Tye")	[reply] [d/l]
Re: Extracting C Style Comments Revised (JavaScript) by Fletch (Bishop) on Oct 23, 2001 at 22:34 UTC
You're probably not going to like this, but it's going to be more trouble than you're probably willing to go through to deal with this solely with regexps. Basically, unless you use fancy trickery like `(?{code})` you can't express enough state in a regular expression to deal with arbitrary Javascript. This is the same reason that you can't handle arbitrary (X\|HT\|SG)ML solely with regexen. If I recalled more than vague snippets I could probably back this up with some mumbo-jumbo about LA(1) and LALR(1) grammars and the like. To do things `right', you'd basically need to write a Javascript parser (or at least a tokenizer) that knows enough about Javascript syntax that it can keep enough state to tell the difference between quotes occurring inside JS regexen and those happening inside of double quoted strings.	[reply]
Re: Extracting C Style Comments Revised (JavaScript) by Tetramin (Sexton) on Oct 24, 2001 at 00:08 UTC
There is the section that says `# First, we'll list things we want # to match, but not throw away` [download] Just do that and add `(?:/[^\r\n\*\/]+/) # Match RegExp \|` [download] after the round bracket ("("). This could work for the above problem. But you cannot make it perfect without further code parsing, e.g. these will still go wrong `.replace(/\//, "") abc/100 // comment` [download]	[reply] [d/l] [select]
Re: Re: Extracting C Style Comments Revised (JavaScript) by Incognito (Pilgrim) on Oct 24, 2001 at 00:36 UTC
Incorporating what you've added as input: New Code $data =~ s{ # First, we'll list things we want # to match, but not throw away ( (?:/[^\r\n\\/]+/) # Match RegExp \| # -or- [^"'/]+ # other stuff \| # -or- (?:"[^"\\](?:\\.[^"\\])" [^"'/])+ # double quoted string \| # -or- (?:'[^'\\](?:\\.[^'\\])' [^"'/])+ # single quoted constant ) \| # or we'll match a comment. Since it's not in the # $1 parentheses above, the comments will disappear # when we use $1 as the replacement text. / # (all comments start with a slash) (?: \[^]\+(?:[^/][^]\+)/ # traditional C comments \| # -or- /[^\n]* # C++ //-style comments ) }{$1}gsx; [download] Updated This code does work for the above examples, but does not work for regular expressions with containing a '', for example. `var b=/\s;\s/gi;` [download] There should be a way for us to do this, because we want to handle that 99% of code that is out there... without writing a parser... I'm thinking we need to modify the regex in the "# Match RegExp" section further, to ignore s and \/s... this may not be easy, and if I figure it out, I'll post it here.	[reply] [d/l] [select]
Re: Re: Re: Extracting C Style Comments Revised (JavaScript) by Tetramin (Sexton) on Oct 24, 2001 at 01:32 UTC
Try `(?:/[^\r\n\\/][^\r\n\/]/)` Still doesn't work with divisions like abc/100 because it now thinks it's the beginning of a regular expression.	[reply] [d/l]
Re: Re: Extracting C Style Comments Revised (JavaScript) by Incognito (Pilgrim) on Oct 24, 2001 at 03:35 UTC


Do you know where your variables are?
	PerlMonks