Regex to ignore comment

crusty_collins has asked for the wisdom of the Perl Monks concerning the following question:

I have a regex that ignores the comment in a string.

$line =~ /pattern\s*=*\s*([\w\d\s\\\/\.\-\@\!\$\%\^\&\*\:\;\,\<\>]+)/
[download]

I know that this is NOT the way to go about it. I have tried to get everything but after the comment like this.

$line =~ /pattern\s*=*\s*(.+?)\s+[^#]/
[download]

but it is not working.

QUESTION. What is the proper regex to do this?

use strict;
use warnings;
my $count;
my $env = [];

    foreach my $line ( <DATA> ) {
        chomp($line);
           next if $line =~ /^\s*\#/;
           next if $line =~ /^$/;

        # file is in the format of
        # [zipcode]
        # regex (?<Zip>\d{5})-(?<Sub>\d{4})
        # pattern 95076-1234
        # pattern 90210-6473
        # [IP]
        # regex = (?<First>2[0-4]\d|25[0-5]|[01]?\d\d?)\.(?<Second>2[0
+-4]\d|25[0-5]|[01]?\d\d?)\.(?<Third>2[0-4]\d|25[0-5]|[01]?\d\d?)\.(?<
+Fourth>2[0-4]\d|25[0-5]|[01]?\d\d?)
        # pattern = 255.257.0.0 # invalid
        #
        #
        if ($line =~ /^\[([\w\s\\]+)\]/ ) {

            my $tag = $1;

            $count = scalar @{$env};
            $env->[$count]->{tag} = $tag;

        }
        elsif ( $line =~ /regex\s*=*\s*(.+)|regex\s*=*\s*/ ) {

            my $regex = $1;

            if ( $env->[$count]->{'regex'} ) {

                $env->[$count]->{'regex'} .= $regex;

            }else{

                $env->[$count]->{'regex'} = $regex;

            }
        } # comments are ignored
        elsif ( $line =~ /pattern\s*=*\s*([\w\d\s\\\/\.\-\@\!\$\%\^\&\
+*\:\;\,\<\>]+)|pattern\s*=*\s*/ ){

            my $pattern = $1;

            push( @{$env->[$count]->{pattern}} , $pattern);

        }
    }

__DATA__
[IP]
regex = (?<First>2[0-4]\d|25[0-5]|[01]?\d\d?)\.(?<Second>2[0-4]\d|25[0
+-5]|[01]?\d\d?)\.(?<Third>2[0-4]\d|25[0-5]|[01]?\d\d?)\.(?<Fourth>2[0
+-4]\d|25[0-5]|[01]?\d\d?)
pattern = 255.257.0.0 # invalid
pattern = 192.168.1.1
[download]

Comment on Regex to ignore comment Select or Download Code

Replies are listed 'Best First'.
Re: Regex to ignore comment by AppleFritter (Vicar) on Oct 20, 2015 at 18:05 UTC
So, if I understand correctly: your file can contain comments on `pattern` lines (and others, too?); comments are indicated by a `#` character, and span the entire rest of the line; you want to process any line that has comments as if it did not. Assuming you don't have to pay attention to quoting (i.e. `pattern` declarations that contain quoted `#`'s), I think the easiest way to accomplish that would be to simply remove any possible comments before parsing a line, i.e.: `while(my $line = <DATA>) { chomp $line ; $line =~ s/\s#.$//; # ... }` [download] Here's a few more tips while I'm at it, too. This bit in the `pattern` regex: `[\w\d\s\\\/\.\-\@\!\$\%\^\&\\:\;\,\<\>]` [download] strikes me as an attempt to negate a character class without actually negating it. Are you trying to say "any character other than `#`" there? If so, you can use `[^#]` instead.* In the following regex: `/regex\s=\s(.+)\|regex\s=\s/` [download] What is the second part of the alternation for? Why not use `(.)` in the first part instead? (The same thing applies to the `pattern` regex, really.) Speaking of which, you use `$1` even though it may not captured anything. Is that what you intended? (And you only use `$pattern` once; might as well get rid of it.)* Use a `while` loop instead of a `foreach` loop; may other monks correct me if I'm wrong, but my understanding is that using `foreach` will cause all data to be read into memory and then iterated over, which could be an issue depending on how much data you need to handle. `while` will read your data one line at a time. I'd also suggest fixing the indenting; don't indent the `foreach` loop itself, and use consistent indenting inside it (e.g. don't indent the `next if` statements near the top an extra level).	[reply] [d/l] [select]
Re^2: Regex to ignore comment by crusty_collins (Friar) on Oct 20, 2015 at 19:32 UTC
Thanks for the comments AppleFritter I don't want to do the regex `$line =~ s/\s#.$//;` becuse it might be in the pattern. such as `pattern = the number is #8 # number` [download] where #8 is in the pattern and # number is a comment I was hoping that i could do a look behind and catch it that way. But corion's way of doing this is really the same thing. `$line =~ s!#.*$!!; # strip off comments my( $key, $value ) = split /=/, $line;` [download]	[reply] [d/l] [select]
Re^3: Regex to ignore comment by AppleFritter (Vicar) on Oct 20, 2015 at 20:19 UTC
I see. But doesn't that make the format itself ambiguous? Put another way, how can you tell the following two apart, programmatically? pattern = the number is #8 # number pattern = 255.257.0.0 # invalid, and BTW, this comment contains a # character To a human (or pony) reading this, it's obvious that the comment starts on the second `#` in the first line, and on the first `#` in the second line. But how would a program tell the difference? This is what I meant by quoting, BTW. If your format required you to write e.g. pattern = "the number is #8" # number to avoid this ambiguity, you'd have to deal with quoting, but at least you'd be able to rely on the first unquoted `#` character on a line to actually indicate a comment.	[reply]
Re^3: Regex to ignore comment by Laurent_R (Canon) on Oct 20, 2015 at 20:17 UTC
Well if `#` can be both part of your pattern and an indication that it is a comment to be removed, then you need to specify how to distinguish between the two cases. With an input string such as: `pattern = the number is #8 # number` [download] this would remove everything after (and including) the last `#` of your string: `s/#[^#]+$//;` [download] but this assumes that you always have a trailing comment in your input. But we have not way to know whether it will work with your other input lines (i.e. if there is always a trailing comment in your lines).	[reply] [d/l] [select]
Re^3: Regex to ignore comment by Corion (Patriarch) on Oct 20, 2015 at 20:18 UTC
How do you know the difference between: `pattern = the number is #8 # number` [download] and `pattern = the number is 9 #8 , changed 2015-10-20` [download] Maybe a comment really starts with `"# "` (hash and then a blank)?	[reply] [d/l] [select]
Re: Regex to ignore comment by Corion (Patriarch) on Oct 20, 2015 at 17:24 UTC
Can you show us data and what you expect to match/reject? Do you want to reject the first line in your __DATA__ section? Would it be easier to simply reject all lines that contain `#` ?	[reply] [d/l]
Re^2: Regex to ignore comment by crusty_collins (Friar) on Oct 20, 2015 at 17:53 UTC
I need to get the text before the comment IE # invalid `__DATA__ [IP] regex = (?<First>..... pattern = 255.257.0.0 # invalid pattern = 192.168.1.1` [download]	[reply] [d/l]
Re^3: Regex to ignore comment by Corion (Patriarch) on Oct 20, 2015 at 18:03 UTC
Why don't you just strip off everything starting with `#`? `$line =~ s!#.*$!!; # strip off comments my( $key, $value ) = split /=/, $line;` [download]	[reply] [d/l] [select]
Re: Regex to ignore comment by u65 (Chaplain) on Oct 20, 2015 at 20:47 UTC
Depending on how much control you have over your data file formats, it might be easier to have a different comment indicator ~~auch~~ such as '---' or '///', or some other unique pattern which would not appear otherwise.	[reply]
Re^2: Regex to ignore comment by crusty_collins (Friar) on Oct 21, 2015 at 13:48 UTC
Thanks for all the comments. I have decided to make the comment string something other than #. I like u65's suggestion to use '---'	[reply]
Re: Regex to ignore comment by Laurent_R (Canon) on Oct 20, 2015 at 17:18 UTC
I am not sure that I understand what you're trying to do, but the simplest way to include comments in a regex is to use the `/x` modifier: anything in your regex from the (unescaped) `#` to the end of the line will be a comment. Otherwise, you might also use the `(?#comment)` construct, but, frankly, the `/x` modifier is more practical.	[reply] [d/l] [select]


"be consistent"
	PerlMonks