Multiline string and one line comments

AskandLearn has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Multiline string and one line comments by kcott (Archbishop) on Apr 16, 2014 at 05:53 UTC
G'day AskandLearn, Welcome to the monastery. Firstly, PerlMonks is not a code writing service. It's not OK to just post a "Requirement" and expect us to do your (home)work for you. This in explained in "How (Not) To Ask A Question" (see the Do Your Own Work section). "Another Perl regex question ..." This suggests you've previously asked such a question; however, this is your first post as AskandLearn. If you have other user accounts, please read "Site Rules Governing User Accounts" and follow the instructions therein. "I tried few things, ..." You need to show us: the guidelines in "How do I post a question effectively?" provide details. Without knowing what you're having trouble with, it's difficult to formulate a response which helps you. Are you an absolute beginner? Do you have reasonable knowledge but have encountered some difficulty which is causing problems? As an interim answer, search for "`regular expression`" in "the perl manpage", and follow whatever links seem most appropriate: there are several to be found in both the Tutorials and Reference sections (and more in the Internals section). perlre has an example of matching a double-quoted string which may help. -- Ken	[reply] [d/l]
Re^2: Multiline string and one line comments by Anonymous Monk on Apr 16, 2014 at 07:09 UTC
"Another Perl regex question ..." ... This suggests you've previously asked such a question; howeve Apparently its a crosspost copy/paste crossposted to http://stackoverflow.com/questions/23100492/match-comment-but-not-inside-a-string, I cross-link cross-postings for maximum collaboration efficiency	[reply]
Re^3: Multiline string and one line comments by AskandLearn (Initiate) on Apr 18, 2014 at 02:45 UTC
Apparently you are wrong.	[reply]
Re^4: Multiline string and one line comments by Anonymous Monk on Apr 18, 2014 at 02:48 UTC
Re^5: Multiline string and one line comments by AskandLearn (Initiate) on Apr 18, 2014 at 03:13 UTC
Some notes below your chosen depth have not been shown here
Re^2: Multiline string and one line comments by AskandLearn (Initiate) on Apr 18, 2014 at 02:50 UTC
I realized that this is bit too hard using regex because I need to know which one of those three character appears first and recheck that each time I found a string or comment. I went back to plain scripting and it is actually pretty easier just use `index` and `substr` function. Here is my code, code writing service ? not for me. #!/usr/bin/env perl use strict; use warnings; my $src = do {local $/; <DATA>}; my @strings = (); my @comments = (); my $off_set = 0; my $end_index = 0; while (my ($char, $start_index) = &next_char($off_set)) { last if ($char eq "" && $start_index == -1); if ($char eq '#') { $end_index = index $src, "\n", $start_index + 1; push @comments, substr($src, $start_index, $end_index-$start_index ++1); $off_set = $end_index + 1; } elsif (($char eq '"') \|\| ($char eq "'")) { &capture_string($char, $start_index, $end_index); } } sub capture_string($ $ $) { my $quote = shift; my $start_index = shift; my $end_index = shift; $end_index = index ($src, $quote, $start_index+1); my $char_before = substr $src, $end_index-1, 1; while ($end_index > 0 && $char_before eq '\\') { $end_index = index $src, $quote, $end_index + 1; $char_before = substr $src, $end_index-1, 1; } push @strings, substr($src, $start_index, $end_index-$start_index+1) +; $off_set = $end_index + 1; } print "[Strings]\n"; foreach my $item (@strings) { print "$item\n"; } print "[Comments]\n"; foreach my $item (@comments) { print "$item"; } sub next_char { my %has; my $position = shift; my $s_index = index $src, "'", $position; my $d_index = index $src, '"', $position; my $c_index = index $src, '#', $position; return ("", -1) if ($s_index == -1 && $d_index == -1 && $c_index == -1); $has{$s_index} = "'" if ($s_index >= 0); $has{$d_index} = '"' if ($d_index >= 0); $has{$c_index} = '#' if ($c_index >= 0); my @sorted_keys = sort { $a <=> $b} keys %has; # print "Next char is $has{$sorted_keys[0]}, and position is $sorted +_keys[0]\n"; return ($has{$sorted_keys[0]}, $sorted_keys[0]); } __DATA__ # this is a comment, should be matched. # # "I am not a string" . 'because I am inside a comment' my $string = " #I am not a comment, because I am quoted"; my $another_string = "I am a multiline string with # on each line #, have fun!"; my $descap_string = "I am a \ escaped \" \"string"; # and some comment +s; my $sescap_string = 'I am a \ escaped \' \'string'; # and some comment +s; my $empty_d =""; my $empty_s =''; [download] And here is the result I wanted `[Strings] " #I am not a comment, because I am quoted" "I am a multiline string with # on each line #, have fun!" "I am a \ escaped \" \"string" 'I am a \ escaped \' \'string' "" '' [Comments] # this is a comment, should be matched. # # "I am not a string" . 'because I am inside a comment' # and some comments; # and some comments;` [download]	[reply] [d/l] [select]
Re^3: Multiline string and one line comments by AnomalousMonk (Archbishop) on Apr 18, 2014 at 04:03 UTC
Here's a regex-based approach. (I agree, however, that a parsing approach may be more appropriate.) It doesn't handle single-quoted strings, but should be easily extensible to cover such. I'm not sure it gives you exactly what you want, but I think it comes close. The critical (IMHO) regexes require Perl version 5.10+. use warnings; use strict; use Test::More # tests => ?? + 1 # Test::NoWarnings adds 1 test 'no_plan' ; use Test::NoWarnings; use constant TEST1 => <<'EOT'; # this is a comment, should be matched. # "I am not a string" . 'because I am inside a comment' my $string = " #I am not a \comment, because I am \" quoted"; my $another_string = "I am a multiline string with # on each \t line #, have fun!"; EOT # print qq{[[${ \TEST1 }]] \n\n}; # FOR DEBUG use constant C1 => '# this is a comment, should be matched.'; use constant C2 => q{# "I am not a string" . 'because I am inside a co +mment'}; use constant S1 => q{" #I am not a \comment, because I am \" quoted"}; use constant S2 => q{"I am a multiline string with # on each \t line #, have fun!"}; # these regexes compatible with 5.8 (and prior? 5.0?) my $comment = qr{ [#] [^\n]* $ }xms; my $string = qr{ " [^"\\]* (?: \\. [^"\\]) " }xms; my $comment_or_string = qr{ $comment \| $string }xms; # these regexes require 5.10+ my $comment_only = qr{ $comment \| $string (SKIP) (FAIL) }xms; my $string_only = qr{ $string \| $comment (SKIP) (FAIL) }xms; VECTOR: for my $ar_vector ( [ TEST1, $comment_or_string, C1, C2, S1, S2, ], [ TEST1, $comment_only, C1, C2, ], [ TEST1, $string_only, S1, S2, ], ) { if (not ref $ar_vector) { # must be a note... note $ar_vector; next VECTOR; } my ($text, $rx, @expected) = @$ar_vector; is_deeply [ $text =~ m{ $rx }xmsg ], \@expected, # qq{}, ; } # end for VECTOR [download]	[reply] [d/l]
A reply falls below the community's threshold of quality. You may see it by logging in.
Re: Multiline string and one line comments by davido (Cardinal) on Apr 16, 2014 at 14:59 UTC
This class of problem may be addressed to some degree by the CPAN module, Text::Balanced. But it looks like you may run into the harder problem of parsing Perl. The PPI module can be helpful, though there are cases where even parsing is not as straightforward as one would expect. Regexes are not generally the appropriate solution for things like code parsing or balanced text parsing. You end up working way too hard on a regex solution that still falls short. tchrist gave an excellent write-up on StackOverflow on why it is possible but usually inadvisable to use regexes as the primary engine in parsing non-trivial inputs (in the case of the writeup, he was talking about HTML, but the reply is applicable here as well). See Oh Yes You Can Use Regexes to Parse HTML!. It all boils down to the amount of work required to get a robust solution using regexes for this sort of thing will usually exceed the amount of work you will go through in using a proper parsing tool. It may seem like a lot of work learning to use these other tools, but not as much as it often takes to properly deal with all of the edge cases using only regexes. Dave	[reply]
Re: Multiline string and one line comments by Laurent_R (Canon) on Apr 16, 2014 at 09:13 UTC
Using regexes for such a job might be asking too much from regexes. I think you should probably think of a parser. You could write a minimal parser yourself (probably using regexes among other things, but the driving idea should be to parse the input, i.e. separate the tokens and and analyze them one by one) or use an existing parsing module and just write the grammar and the code to call and use it. As an example, you could use Damian Conway's Parse-RecDescent module: http://search.cpan.org/~jtbraun/Parse-RecDescent-1.967009/lib/Parse/RecDescent.pm. There may be some more lightweight packages, but I do not know them.	[reply]
Re^2: Multiline string and one line comments by AskandLearn (Initiate) on Apr 18, 2014 at 03:15 UTC
Yes I agree, took me a while to realize. I got my own solution just by plain scripting. Thanks for pointing it out.	[reply]


good chemistry is complicated, and a little bit messy -LW
	PerlMonks