http://www.perlmonks.org?node_id=679356

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi monks, I have a really quick question... I'm a Perl newbie... just started a couple days ago: :- Here is my question: How would I remove all comment strings (double slashes and comment text) in an arbitrary javascript function similar to the below comment without removing things like http://... strings? // comment string I know it would be something like $fileString=~s/ not sure what the pattern replace would be after the s/ Thanks very much for any assistance you may provide. John

Replies are listed 'Best First'.
Re: Removing javascript comments
by Joost (Canon) on Apr 09, 2008 at 22:30 UTC
    this is actually fairly complex to do correctly, since javascript has 2 different comment delimiters, and 'comment-like' constructs in string and regex literals (which are especially tricky to recognize) should not be mistaken for comments. You may want to use a full javascript grammar/parser instead of starting from scratch.

    If you feel like experimenting, take a look at JE::Parser. Probably not for newbies, though...

    Otherwise, you can take advantage of a few facts:

    javascript string and regex literals are always single line. This means if you go trough the code line by line, ignoring string and regex literals, you can safely assume** any /* .. */ and // ... constructs left are comments (javascript does not allow for an empty regex literal). As I mentioned before, regex literals will be tricky, since the JS grammar for where regexes are valid (instead of divide operators) are fairly tricky, IOW, you will need to distinguish between

    var res = a / 4 /* whatever */ ^^^^^^^^^^^^^^ comment
    and
    var res = /bla \/* .*/.exec("stuff"); ^^^^^^^ not a comment
    for instance.

    ** in fact you can't, but it's /probably/ good enough.

    update: fixed js code.

Re: Removing javascript comments
by renodino (Curate) on Apr 09, 2008 at 23:39 UTC
    You might cheat from Dean Edward's packer implemented in Perl (and several other languages). It does a great deal more than just strip comments, but you can always run the output back thru a beautifier to reconstruct code layout.

    Otherwise, I'd suggest looking into Text::Balanced, since parsing JS has many of the same gotchas as parsing Perl, notably, regex syntax that can contain stuff that looks like comments.


    Perl Contrarian & SQL fanboy
Re: Removing javascript comments
by tachyon-II (Chaplain) on Apr 10, 2008 at 03:53 UTC

    As noted to do this properly you really need an HTML parser to extract the javascript and then a javascript parser to parse the javascript. A regex solution will never be 100% reliable BUT that said it is a great way to learn about regexes as you get to write a regex to do a task and then find out it does unexpected stuff!

    perlre is the reference to regexes. Although s/this/that/ is the usual format ie using / to delimit the regex you can in fact use just about anything. When you have / symbols to match you either use a different delimiter of you have to escape the / symbols in the regex with a \ so \/ means match /. Compare the readability of:

    $str =~ s/http:\/\///g; # and $str =~ s|http://||g; $str =~ s#http://##g; $str =~ s!http://!!g;

    The first example is a bit harder to read due to the escapes. At least it is to me. Unfortunately the * char is a regex wildcard so you will still need to escape that. To get you started try:

    $str =~ s!(/\*[^/]+\*/)!>>>> \1 <<<<!g; # /* comments */ $str =~ s!(\s//[^/\n]+)!>>>> \1 <<<<!g; # // comments print $str;

    There will be plenty of edge cases not handled by this code but you can have a great learning experience trying to fix them without breaking other things! The () in the match section just allow you to explicity see what has happened as we capture the match and do some funky ascii highlighting in the replace section. To strip you just replace with nothing and don't need to capture but first you need to see what is happening. For debugging you may like to use:

    $str =~ s!(fancy re here)! print "$1\n"; "" !ge;

    That way when you run the code on a file you get the comments stripped but also printed to STDOUT so you can watch and make sure all the stuff that is going looks like comments!

    Welcome to Perl, hope you have fun and it helps you to get the job done.

      Thank you all very much for your valuable feedback... I almost have my code finished with the exception of the removal of javascript comments, so I will try out all your suggestions and see what happens... the best way to learn of course is to dabble in different methodologies to try and find the optimum solution for the given problem... looking forward to getting in to some cool Perl projects in the future... I'll post again once I have the solution in place... thanks again!
        Hi, I have a similar task and was wondering if you were able to accomplish this?
Re: Removing javascript comments
by deibyz (Hermit) on Apr 10, 2008 at 08:19 UTC
Re: Removing javascript comments
by cparker (Initiate) on Apr 11, 2008 at 19:29 UTC
    John Crockford's JSMin removes comments as part of the minifcation process. There's a Perl implementation of JSMin at CPAN. You could study how this implementation removes comments and try to duplicate it. (Of course, if your license is compatible with Perl's license, you could just lift the relevant portions from the module, but then you wouldn't be learning anything. :)