Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

Regex to ignore comment

by crusty_collins (Friar)
on Oct 20, 2015 at 16:51 UTC ( [id://1145460]=perlquestion: print w/replies, xml ) Need Help??

crusty_collins has asked for the wisdom of the Perl Monks concerning the following question:

I have a regex that ignores the comment in a string.

$line =~ /pattern\s*=*\s*([\w\d\s\\\/\.\-\@\!\$\%\^\&\*\:\;\,\<\>]+)/

I know that this is NOT the way to go about it. I have tried to get everything but after the comment like this.

$line =~ /pattern\s*=*\s*(.+?)\s+[^#]/
but it is not working.

QUESTION. What is the proper regex to do this?

use strict; use warnings; my $count; my $env = []; foreach my $line ( <DATA> ) { chomp($line); next if $line =~ /^\s*\#/; next if $line =~ /^$/; # file is in the format of # [zipcode] # regex (?<Zip>\d{5})-(?<Sub>\d{4}) # pattern 95076-1234 # pattern 90210-6473 # [IP] # regex = (?<First>2[0-4]\d|25[0-5]|[01]?\d\d?)\.(?<Second>2[0 +-4]\d|25[0-5]|[01]?\d\d?)\.(?<Third>2[0-4]\d|25[0-5]|[01]?\d\d?)\.(?< +Fourth>2[0-4]\d|25[0-5]|[01]?\d\d?) # pattern = 255.257.0.0 # invalid # # if ($line =~ /^\[([\w\s\\]+)\]/ ) { my $tag = $1; $count = scalar @{$env}; $env->[$count]->{tag} = $tag; } elsif ( $line =~ /regex\s*=*\s*(.+)|regex\s*=*\s*/ ) { my $regex = $1; if ( $env->[$count]->{'regex'} ) { $env->[$count]->{'regex'} .= $regex; }else{ $env->[$count]->{'regex'} = $regex; } } # comments are ignored elsif ( $line =~ /pattern\s*=*\s*([\w\d\s\\\/\.\-\@\!\$\%\^\&\ +*\:\;\,\<\>]+)|pattern\s*=*\s*/ ){ my $pattern = $1; push( @{$env->[$count]->{pattern}} , $pattern); } } __DATA__ [IP] regex = (?<First>2[0-4]\d|25[0-5]|[01]?\d\d?)\.(?<Second>2[0-4]\d|25[0 +-5]|[01]?\d\d?)\.(?<Third>2[0-4]\d|25[0-5]|[01]?\d\d?)\.(?<Fourth>2[0 +-4]\d|25[0-5]|[01]?\d\d?) pattern = 255.257.0.0 # invalid pattern = 192.168.1.1

Replies are listed 'Best First'.
Re: Regex to ignore comment
by AppleFritter (Vicar) on Oct 20, 2015 at 18:05 UTC

    So, if I understand correctly:

    • your file can contain comments on pattern lines (and others, too?);
    • comments are indicated by a # character, and span the entire rest of the line;
    • you want to process any line that has comments as if it did not.

    Assuming you don't have to pay attention to quoting (i.e. pattern declarations that contain quoted #'s), I think the easiest way to accomplish that would be to simply remove any possible comments before parsing a line, i.e.:

    while(my $line = <DATA>) { chomp $line ; $line =~ s/\s*#.*$//; # ... }

    Here's a few more tips while I'm at it, too.

    • This bit in the pattern regex:

      [\w\d\s\\\/\.\-\@\!\$\%\^\&\*\:\;\,\<\>]

      strikes me as an attempt to negate a character class without actually negating it. Are you trying to say "any character other than #" there? If so, you can use [^#] instead.

    • In the following regex:

      /regex\s*=*\s*(.+)|regex\s*=*\s*/

      What is the second part of the alternation for? Why not use (.*) in the first part instead? (The same thing applies to the pattern regex, really.)

    • Speaking of which, you use $1 even though it may not captured anything. Is that what you intended? (And you only use $pattern once; might as well get rid of it.)

    • Use a while loop instead of a foreach loop; may other monks correct me if I'm wrong, but my understanding is that using foreach will cause all data to be read into memory and then iterated over, which could be an issue depending on how much data you need to handle. while will read your data one line at a time.

    • I'd also suggest fixing the indenting; don't indent the foreach loop itself, and use consistent indenting inside it (e.g. don't indent the next if statements near the top an extra level).

      Thanks for the comments AppleFritter

      I don't want to do the regex $line =~ s/\s*#.*$//; becuse it might be in the pattern. such as

      pattern = the number is #8 # number

      where #8 is in the pattern

      and # number is a comment

      I was hoping that i could do a look behind and catch it that way.

      But corion's way of doing this is really the same thing.

      $line =~ s!#.*$!!; # strip off comments my( $key, $value ) = split /=/, $line;

        I see. But doesn't that make the format itself ambiguous? Put another way, how can you tell the following two apart, programmatically?

        pattern = the number is #8       # number
        
        pattern = 255.257.0.0            # invalid, and BTW, this comment contains a # character
        

        To a human (or pony) reading this, it's obvious that the comment starts on the second # in the first line, and on the first # in the second line. But how would a program tell the difference?

        This is what I meant by quoting, BTW. If your format required you to write e.g.

        pattern = "the number is #8"     # number
        

        to avoid this ambiguity, you'd have to deal with quoting, but at least you'd be able to rely on the first unquoted # character on a line to actually indicate a comment.

        Well if # can be both part of your pattern and an indication that it is a comment to be removed, then you need to specify how to distinguish between the two cases.

        With an input string such as:

        pattern = the number is #8 # number
        this would remove everything after (and including) the last # of your string:
        s/#[^#]+$//;
        but this assumes that you always have a trailing comment in your input.

        But we have not way to know whether it will work with your other input lines (i.e. if there is always a trailing comment in your lines).

        How do you know the difference between:

        pattern = the number is #8 # number

        and

        pattern = the number is 9 #8 , changed 2015-10-20

        Maybe a comment really starts with "# " (hash and then a blank)?

Re: Regex to ignore comment
by Corion (Patriarch) on Oct 20, 2015 at 17:24 UTC

    Can you show us data and what you expect to match/reject? Do you want to reject the first line in your __DATA__ section? Would it be easier to simply reject all lines that contain # ?

      I need to get the text before the comment IE # invalid

      __DATA__ [IP] regex = (?<First>..... pattern = 255.257.0.0 # invalid pattern = 192.168.1.1

        Why don't you just strip off everything starting with #?

        $line =~ s!#.*$!!; # strip off comments my( $key, $value ) = split /=/, $line;
Re: Regex to ignore comment
by u65 (Chaplain) on Oct 20, 2015 at 20:47 UTC

    Depending on how much control you have over your data file formats, it might be easier to have a different comment indicator auch such as '---' or '///', or some other unique pattern which would not appear otherwise.

      Thanks for all the comments.

      I have decided to make the comment string something other than #.

      I like u65's suggestion to use '---'

Re: Regex to ignore comment
by Laurent_R (Canon) on Oct 20, 2015 at 17:18 UTC
    I am not sure that I understand what you're trying to do, but the simplest way to include comments in a regex is to use the /x modifier: anything in your regex from the (unescaped) # to the end of the line will be a comment.

    Otherwise, you might also use the (?#comment) construct, but, frankly, the /x modifier is more practical.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1145460]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others wandering the Monastery: (5)
As of 2024-04-19 23:16 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found