Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

matching comments

by Eugene (Scribe)
on Apr 22, 2000 at 00:29 UTC ( [id://8572]=perlquestion: print w/replies, xml ) Need Help??

Eugene has asked for the wisdom of the Perl Monks concerning the following question:

I need to match arbitrary comments. The user is gonna say the starting and the ending comments and the will be put into variables like
$var1="/*" and $var2="*/".
There is a problem because they can be of any length, like in HTML, and can comment out multiple lines. Also, it's hard to match something like **/ The above C style comments are matched with /\*([^*]|\*+[^/*])*\*+/ But how would I make it work for anything? Thanks

Replies are listed 'Best First'.
Re: matching comments
by Eugene (Scribe) on Apr 22, 2000 at 00:33 UTC
    The C style comment catching expression is " /*([^*]|\*+[^/*])*\*+/ "
    A reply falls below the community's threshold of quality. You may see it by logging in.
Re: matching comments
by merlyn (Sage) on Apr 25, 2000 at 22:57 UTC
    UNTESTED, but a lot of my stuff works without testing.... :-)
    my $start = "/*";
    my $end = "*/";
    my $inside = 0;
    my $oldpos = 0;
    $_ = "your text /* goes here */ and here";
    while (/(\Q$start\E|\Q$end\E)/g) {
      if ($1 eq $start) {
        if (++$inside == 1) {
          $oldpos = pos($_) - length($start);
        }
      } else {
        if (--$inside == 0) {
          print substr($_, $oldpos, pos()-$oldpos);
        }
      }
    } 
    
Re: matching comments
by perlmonkey (Hermit) on Apr 23, 2000 at 05:11 UTC
    This is do the trick for arbitrary comment flags:
    /\Q$start\E(.*?)\Q$end\E/sg

    I wrote a small test program:
    #!/usr/bin/perl #get start and end comment my $start = $ARGV[0]; my $end = $ARGV[1] || "\n"; open(FILE, "test.txt") || die; { local $/ = undef; #set to 'slurp' mode $_ = <FILE>; #read entire file into $_ } close FILE; # #print all comments that are matched in file # while( /\Q$start\E(.*?)\Q$end\E/sg ) { print $&, "\n"; }
    For a test.txt I used:
    blah blah blah blah blah *** multi line comment **!! blah blah blah blah blah blah blah blah blah blah blah *** inline comment **!! blah blah /* c comment 1 */ blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah /* * c comment 2 */ blah blah // c++ comment 1 blah blah // c++ comment 2
    For my execution results I got (my exe is called regex.pl):
    prompt$ regex.pl '***' '**!!' *** multi line comment **!! *** inline comment **!! prompt$ regex.pl '/*' '*/' /* c comment 1 */ /* * c comment 2 */ prompt$ regex.pl '//' // c++ comment 1 // c++ comment 2
      The problem with this--and I don't know if it's actually going to be a problem for the OP, but in general, it might be--is that this will catch comments inside quoted strings. For example:
      char * comptr = "Comment: /* In comment. */";
      Your regular expression will match this, but it isn't actually a comment.

      Again, this may not be an issue for the OP, but if it is, you should take a look at the the faq How do I use a regular expression to strip C style comments from a file?; perhaps you can extend this to your uses.

Re: matching comments
by perlmonkey (Hermit) on Apr 24, 2000 at 10:55 UTC
    Now I think I should say that I am not aware of any compiler that will compile code with nested comments, so this is probably not a big problem.

    However I played around and this seems to do the trick: (just replace the while loop in my code above with this one.)
    while( $file =~ /\Q$start\E(.*?)\Q$end\E/sg ) { $a = $1; $match = $&; #look for more start tags in what we matched while( $a =~ /\Q$start\E/sg ) { #balance the ending comments $file =~ /.*?\Q$end\E/sg; $match .= $&; } print $match, "\n"; }
    For your tests file I got what you wanted.

    For other tests I used this test.txt:
    blah blah /* comment 1 */ blah blah /* comment 2 */ blah blah /* outer /* mid /* center */ mid */ outer */
    And here are my results:
    prompt$ regex.pl '/*' '*/' /* comment 1 */ /* comment 2 */ /* outer /* mid /* center */ mid */ outer */
    So enjoy this fanciful result.

    I hope this helps.
      UNTESTED, but a lot of my stuff works without testing.... :-)
      my $start = "/*"; my $end = "*/"; my $inside = 0; my $oldpos = 0; $_ = "your text /* goes here */ and here"; while (/(\Q$start\E|\Q$end\E)/g) { if ($1 eq $start) { if (++$inside == 1) { $oldpos = pos($_) - length($start); } } else { if (--$inside == 0) { print substr($_, $oldpos, pos()-$oldpos); } } }
      That fails on
      /* outer /* mid */ /* mid */ outer */
      Try:
      ($re = $_)=~s/((\Q$start\E)|(\Q$end\E)|.)/${['(','']}[!$2]\Q$1\E${[')' +,'']}[!$3]/gs; $re = join'|',map quotemeta,eval{/$re/}; warn $@ if $@ =~ /unmatched/; print join"\n",/($re)/g,"";
Re: matching comments
by perlmonkey (Hermit) on Apr 25, 2000 at 23:24 UTC
    I just test the code above, and it does indeed work.

    Out of curiosity I benchmarked the two solutions, and merlyn's is twice as fast! (nice).

    However, I would consider mine easier to follow, but maybe that is just because I wrote it. I think I will have to look into using subsrting and pos for performance issues though.
    Here is my test code:
    #!/usr/bin/perl use Benchmark; #get start and end comment my $start = $ARGV[0]; my $end = $ARGV[1] || "\n"; my $file; open(FILE, "test.txt") || die; { local $/ = undef; #set to 'slurp' mode $file = <FILE>; #read entire file into $_ } close FILE; timethese(100000, { 'parse1' => sub { &parse1($file) }, 'parse2' => sub { &parse2($file) }, }); sub parse1 { my $file = shift; while( $file =~ /\Q$start\E(.*?)\Q$end\E/sg ) { $a = $1; $match = $&; #look for more start tags in what we matched while( $a =~ /\Q$start\E/sg ) { #balance the ending comments $file =~ /.*?\Q$end\E/sg; $match .= $&; } #print $match, "\n"; } return $match; } sub parse2 { my $file = shift; my $inside = 0; my $oldpos = 0; while ($file =~ /(\Q$start\E|\Q$end\E)/g) { if ($1 eq $start) { if (++$inside == 1) { $oldpos = pos($file) - length($start); } } else { if (--$inside == 0) { return substr($file, $oldpos, pos($file)-$oldpos); } } } }
    And here is my results (I used the same test.txt as my post above):
    prompt% parse.pl '/*' '*/' Benchmark: timing 100000 iterations of parse1, parse2... parse1: 42 wallclock secs (30.07 usr + 0.11 sys = 30.18 CPU) parse2: 17 wallclock secs (13.60 usr + 0.04 sys = 13.64 CPU)
Re: matching comments
by Eugene (Scribe) on Apr 28, 2000 at 21:49 UTC
    Here is another issue, merlyn's program does not catch the escaped comment. Like <CODE>Some text /*comment \*/ more comment*/<CODE>.

    In fact it completely weeds the escape character out.

    Any ways around it?
      Quite true. Look at perlre for the comments on the "?<!" operator (zero-width negative lookbehind assertion operator).

      This will fix merlyn's code (where $file is the text you are parsing):
      my $file = shift; my $inside = 0; my $oldpos = 0; while ($file =~ /(?<!\\)(\Q$start\E|\Q$end\E)/g) { if ($1 eq $start) { if (++$inside == 1) { $oldpos = pos($file) - length($start); } } else { if (--$inside == 0) { print substr($file, $oldpos, pos($file)-$oldpos); } } }
      the (?<!\\)\Q$start\E will match what is in $start but not preceeded by a '\' character.
Re: matching comments
by Eugene (Scribe) on Apr 24, 2000 at 07:53 UTC
    I am not worrying about comments inside the quotes,but what about nested comments?
    For my test.txt I used :

    blah blah blah blah blah /*
    multi /* bla */
    line
    comment
    */

    and the result was :
    /*
    multi /* bla */

    Any idea on how to handle those so the result would be like
    /*
    multi /* bla */
    line
    comment
    */

    Thanks, Eugene
Re: matching comments
by Eugene (Scribe) on Apr 25, 2000 at 01:20 UTC
    thanks for your time. This is what I needed.
    Eugene

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://8572]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others romping around the Monastery: (10)
As of 2024-05-20 15:33 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found