Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

Extracting C Style Comments Revised (JavaScript)

by Incognito (Pilgrim)
on Oct 23, 2001 at 22:08 UTC ( [id://120854]=perlquestion: print w/replies, xml ) Need Help??

Incognito has asked for the wisdom of the Perl Monks concerning the following question:

Introduction

This is the code taken from Mastering Regular Expressions book, used to remove all comments from a file (stored in a string).

$data =~ s{ # First, we'll list things we want # to match, but not throw away ( [^"'/]+ # other stuff | # -or- (?:"[^"\\]*(?:\\.[^"\\]*)*" [^"'/]*)+ # double quoted string | # -or- (?:'[^'\\]*(?:\\.[^'\\]*)*' [^"'/]*)+ # single quoted constant ) | # or we'll match a comment. Since it's not in the # $1 parentheses above, the comments will disappear # when we use $1 as the replacement text. / # (all comments start with a slash) (?: \*[^*]*\*+(?:[^/*][^*]*\*+)*/ # traditional C comments | # -or- /[^\n]* # C++ //-style comments ) }{$1}gsx;

Problem

When a JavaScript file contains regular expressions, say a replace statement on a string:

// Here are some comments var strText = "fee fi fo fum"; // More strText = strText.replace (/fee/i, "pee"); // More alert (strText); /* More */
problems can arise with this parsing mechanism. The problem? Regular expressions with quotes. For example:
// Here are some comments var strText = "My \"big\" example"; // More strText = strText.replace (/"/gi, "'"); // More alert (strText); /* More */
If the JavaScript being parsed contained a quote in a regex, then it thinks we have a double quoted or single quoted string, so the comments are left in the file (which is not what we want).

Question

How do we modify this beautiful comment-extracting regex, which generally works on 75%+ of the JavaScript out there, to handle regular expressions such as split(), match(), replace(), search() and test()?

Ultimately the goal would be to parse through this simple JavaScript file containing comments and functions using the regular expressions mentioned above, and leaving it only with pure code (no comments):

Sample JavaScript File

function BadQuoteTest (strInput) { strInput = strInput.replace(/"/gi, "'"); // aka aaa; /* This stuff is commented out so should be parsed out strInput = strInput.replace(/x/gi, "y"); // aka bbb; */ return (strInput); } function splitTest (strInput) { var pattern = /\s*;\s*/gi; return (strInput.split (pattern)); // Test } function splitTestWithLimit (strInput) { return (strInput.split (/\"/gi, 3)); // Test } function matchTest () { var strText = 'Cool text'; strText = strText.match (/oo/gi); // Test alert (strText); } function searchTest () { var strText = "Search text"; // Test strText = strText.search (/x/gi); // Test alert (strText); }

Replies are listed 'Best First'.
(tye)Re: Extracting C Style Comments Revised (JavaScript)
by tye (Sage) on Oct 23, 2001 at 23:05 UTC

    Don't use a single regex. Have a regex for each type of item and parse them out as you go (stealing from myself):

    while( $code !~ m#\G$#gc ) { if( $code =~ m#\G//(.*)\n#gc ) { # $1 is end-of-line comment } elsif( $code =~ m#\G"((?:[^"\\]|\\.)*)"#gc ) { # $1 is the inside double quotes } elsif( $code =~ m#\G'((?:[^'\\]|\\.)*)'#gc ) { # $1 is the inside single quotes } elsif( $code =~ m#\G/*(.?)*/#gmc ) { # $1 is a comment } elsif( $code =~ m#\G/((?:[^/\\]|\\.)*)/#gc ) { # $1 is a regex } elsif( $code =~ m#\G([^/'"]+)#gc ) { # $1 is "other code" } elsif( $code =~ m#\G/#gc ) { # division, we hope. } else { # We have hit invalid code? } }
    But this can still be fooled, though it is much harder. (:

    Update: The only real problem (based on some guesses since I don't know JavaScript syntax) is determining whether a / is starting a regex or is denoting division (I'm assuming that "//" isn't a valid regex and is always an end-of-line comment). This is similar to the problem with parsing Perl, knowing whether the next thing is supposed to be a term or an operator. Perl makes this extra hard because function prototypes can change whether the parser expects a term or operator to follow that function invocation. I doubt JavaScript makes things that hard so someone who understands the syntax could probably fix up my code to work 100% of the time.

    To go beyond this, I'd probably start looking into Parse::RecDescent.

            - tye (but my friends call me "Tye")
Re: Extracting C Style Comments Revised (JavaScript)
by Fletch (Bishop) on Oct 23, 2001 at 22:34 UTC

    You're probably not going to like this, but it's going to be more trouble than you're probably willing to go through to deal with this solely with regexps. Basically, unless you use fancy trickery like (?{code}) you can't express enough state in a regular expression to deal with arbitrary Javascript. This is the same reason that you can't handle arbitrary (X|HT|SG)ML solely with regexen. If I recalled more than vague snippets I could probably back this up with some mumbo-jumbo about LA(1) and LALR(1) grammars and the like.

    To do things `right', you'd basically need to write a Javascript parser (or at least a tokenizer) that knows enough about Javascript syntax that it can keep enough state to tell the difference between quotes occurring inside JS regexen and those happening inside of double quoted strings.

Re: Extracting C Style Comments Revised (JavaScript)
by Tetramin (Sexton) on Oct 24, 2001 at 00:08 UTC
    There is the section that says
    # First, we'll list things we want # to match, but not throw away
    Just do that and add
    (?:/[^\r\n\*\/]+/) # Match RegExp |
    after the round bracket ("("). This could work for the above problem. But you cannot make it perfect without further code parsing, e.g. these will still go wrong
    .replace(/\//, "") abc/100 // comment

      Incorporating what you've added as input:

      New Code

      $data =~ s{ # First, we'll list things we want # to match, but not throw away ( (?:/[^\r\n\*\/]+/) # Match RegExp | # -or- [^"'/]+ # other stuff | # -or- (?:"[^"\\]*(?:\\.[^"\\]*)*" [^"'/]*)+ # double quoted string | # -or- (?:'[^'\\]*(?:\\.[^'\\]*)*' [^"'/]*)+ # single quoted constant ) | # or we'll match a comment. Since it's not in the # $1 parentheses above, the comments will disappear # when we use $1 as the replacement text. / # (all comments start with a slash) (?: \*[^*]*\*+(?:[^/*][^*]*\*+)*/ # traditional C comments | # -or- /[^\n]* # C++ //-style comments ) }{$1}gsx;

      Updated

      This code does work for the above examples, but does not work for regular expressions with containing a '*', for example.
      var b=/\s*;\s*/gi;
      There should be a way for us to do this, because we want to handle that 99% of code that is out there... without writing a parser...

      I'm thinking we need to modify the regex in the "# Match RegExp" section further, to ignore *s and \/s... this may not be easy, and if I figure it out, I'll post it here.

        Try (?:/[^\r\n\*\/][^\r\n\/]*/)

        Still doesn't work with divisions like abc/100 because it now thinks it's the beginning of a regular expression.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://120854]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others imbibing at the Monastery: (4)
As of 2024-04-19 23:56 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found