Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

Multiline string and one line comments

by AskandLearn (Initiate)
on Apr 16, 2014 at 05:17 UTC ( [id://1082423]=perlquestion: print w/replies, xml ) Need Help??

AskandLearn has asked for the wisdom of the Perl Monks concerning the following question:

Hi everyone

Another Perl regex question to match comments and strings

Requirement/regex spec/what should be matched and what should not

double or single quotes strings are not strings if inside comments

# inside strings are not comments

Here is an example, strings and comments needs to be captured and HTML style tags will added to highlight them later

# this is a comment, should be matched. # "I am not a string" . 'because I am inside a comment' my $string = " #I am not a comment, because I am quoted"; my $another_string = "I am a multiline string with # on each line #, have fun!";

I tried few things, could not work out a solution to cover all the situations.

Replies are listed 'Best First'.
Re: Multiline string and one line comments
by kcott (Archbishop) on Apr 16, 2014 at 05:53 UTC

    G'day AskandLearn,

    Welcome to the monastery.

    Firstly, PerlMonks is not a code writing service. It's not OK to just post a "Requirement" and expect us to do your (home)work for you. This in explained in "How (Not) To Ask A Question" (see the Do Your Own Work section).

    "Another Perl regex question ..."

    This suggests you've previously asked such a question; however, this is your first post as AskandLearn. If you have other user accounts, please read "Site Rules Governing User Accounts" and follow the instructions therein.

    "I tried few things, ..."

    You need to show us: the guidelines in "How do I post a question effectively?" provide details.

    Without knowing what you're having trouble with, it's difficult to formulate a response which helps you. Are you an absolute beginner? Do you have reasonable knowledge but have encountered some difficulty which is causing problems?

    As an interim answer, search for "regular expression" in "the perl manpage", and follow whatever links seem most appropriate: there are several to be found in both the Tutorials and Reference sections (and more in the Internals section). perlre has an example of matching a double-quoted string which may help.

    -- Ken

        Apparently you are wrong.

      I realized that this is bit too hard using regex because I need to know which one of those three character appears first and recheck that each time I found a string or comment.

      I went back to plain scripting and it is actually pretty easier just use index and substr function. Here is my code, code writing service ? not for me.
      #!/usr/bin/env perl use strict; use warnings; my $src = do {local $/; <DATA>}; my @strings = (); my @comments = (); my $off_set = 0; my $end_index = 0; while (my ($char, $start_index) = &next_char($off_set)) { last if ($char eq "" && $start_index == -1); if ($char eq '#') { $end_index = index $src, "\n", $start_index + 1; push @comments, substr($src, $start_index, $end_index-$start_index ++1); $off_set = $end_index + 1; } elsif (($char eq '"') || ($char eq "'")) { &capture_string($char, $start_index, $end_index); } } sub capture_string($ $ $) { my $quote = shift; my $start_index = shift; my $end_index = shift; $end_index = index ($src, $quote, $start_index+1); my $char_before = substr $src, $end_index-1, 1; while ($end_index > 0 && $char_before eq '\\') { $end_index = index $src, $quote, $end_index + 1; $char_before = substr $src, $end_index-1, 1; } push @strings, substr($src, $start_index, $end_index-$start_index+1) +; $off_set = $end_index + 1; } print "[Strings]\n"; foreach my $item (@strings) { print "$item\n"; } print "[Comments]\n"; foreach my $item (@comments) { print "$item"; } sub next_char { my %has; my $position = shift; my $s_index = index $src, "'", $position; my $d_index = index $src, '"', $position; my $c_index = index $src, '#', $position; return ("", -1) if ($s_index == -1 && $d_index == -1 && $c_index == -1); $has{$s_index} = "'" if ($s_index >= 0); $has{$d_index} = '"' if ($d_index >= 0); $has{$c_index} = '#' if ($c_index >= 0); my @sorted_keys = sort { $a <=> $b} keys %has; # print "Next char is $has{$sorted_keys[0]}, and position is $sorted +_keys[0]\n"; return ($has{$sorted_keys[0]}, $sorted_keys[0]); } __DATA__ # this is a comment, should be matched. # # "I am not a string" . 'because I am inside a comment' my $string = " #I am not a comment, because I am quoted"; my $another_string = "I am a multiline string with # on each line #, have fun!"; my $descap_string = "I am a \ escaped \" \"string"; # and some comment +s; my $sescap_string = 'I am a \ escaped \' \'string'; # and some comment +s; my $empty_d =""; my $empty_s ='';
      And here is the result I wanted
      [Strings] " #I am not a comment, because I am quoted" "I am a multiline string with # on each line #, have fun!" "I am a \ escaped \" \"string" 'I am a \ escaped \' \'string' "" '' [Comments] # this is a comment, should be matched. # # "I am not a string" . 'because I am inside a comment' # and some comments; # and some comments;

        Here's a regex-based approach. (I agree, however, that a parsing approach may be more appropriate.) It doesn't handle single-quoted strings, but should be easily extensible to cover such. I'm not sure it gives you exactly what you want, but I think it comes close. The critical (IMHO) regexes require Perl version 5.10+.

        use warnings; use strict; use Test::More # tests => ?? + 1 # Test::NoWarnings adds 1 test 'no_plan' ; use Test::NoWarnings; use constant TEST1 => <<'EOT'; # this is a comment, should be matched. # "I am not a string" . 'because I am inside a comment' my $string = " #I am not a \comment, because I am \" quoted"; my $another_string = "I am a multiline string with # on each \t line #, have fun!"; EOT # print qq{[[${ \TEST1 }]] \n\n}; # FOR DEBUG use constant C1 => '# this is a comment, should be matched.'; use constant C2 => q{# "I am not a string" . 'because I am inside a co +mment'}; use constant S1 => q{" #I am not a \comment, because I am \" quoted"}; use constant S2 => q{"I am a multiline string with # on each \t line #, have fun!"}; # these regexes compatible with 5.8 (and prior? 5.0?) my $comment = qr{ [#] [^\n]* $ }xms; my $string = qr{ " [^"\\]* (?: \\. [^"\\]*)* " }xms; my $comment_or_string = qr{ $comment | $string }xms; # these regexes require 5.10+ my $comment_only = qr{ $comment | $string (*SKIP) (*FAIL) }xms; my $string_only = qr{ $string | $comment (*SKIP) (*FAIL) }xms; VECTOR: for my $ar_vector ( [ TEST1, $comment_or_string, C1, C2, S1, S2, ], [ TEST1, $comment_only, C1, C2, ], [ TEST1, $string_only, S1, S2, ], ) { if (not ref $ar_vector) { # must be a note... note $ar_vector; next VECTOR; } my ($text, $rx, @expected) = @$ar_vector; is_deeply [ $text =~ m{ $rx }xmsg ], \@expected, # qq{}, ; } # end for VECTOR
    A reply falls below the community's threshold of quality. You may see it by logging in.
Re: Multiline string and one line comments
by davido (Cardinal) on Apr 16, 2014 at 14:59 UTC

    This class of problem may be addressed to some degree by the CPAN module, Text::Balanced. But it looks like you may run into the harder problem of parsing Perl. The PPI module can be helpful, though there are cases where even parsing is not as straightforward as one would expect. Regexes are not generally the appropriate solution for things like code parsing or balanced text parsing. You end up working way too hard on a regex solution that still falls short.

    tchrist gave an excellent write-up on StackOverflow on why it is possible but usually inadvisable to use regexes as the primary engine in parsing non-trivial inputs (in the case of the writeup, he was talking about HTML, but the reply is applicable here as well). See Oh Yes You Can Use Regexes to Parse HTML!. It all boils down to the amount of work required to get a robust solution using regexes for this sort of thing will usually exceed the amount of work you will go through in using a proper parsing tool. It may seem like a lot of work learning to use these other tools, but not as much as it often takes to properly deal with all of the edge cases using only regexes.


    Dave

Re: Multiline string and one line comments
by Laurent_R (Canon) on Apr 16, 2014 at 09:13 UTC
    Using regexes for such a job might be asking too much from regexes. I think you should probably think of a parser. You could write a minimal parser yourself (probably using regexes among other things, but the driving idea should be to parse the input, i.e. separate the tokens and and analyze them one by one) or use an existing parsing module and just write the grammar and the code to call and use it.

    As an example, you could use Damian Conway's Parse-RecDescent module: http://search.cpan.org/~jtbraun/Parse-RecDescent-1.967009/lib/Parse/RecDescent.pm. There may be some more lightweight packages, but I do not know them.

      Yes I agree, took me a while to realize. I got my own solution just by plain scripting. Thanks for pointing it out.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1082423]
Approved by kcott
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others browsing the Monastery: (6)
As of 2024-04-23 10:25 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found