Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

How to strip comments and whitespace from a regex defined with /x?

by jh (Beadle)
on Jan 19, 2018 at 19:27 UTC ( [id://1207556] : perlquestion . print w/replies, xml ) Need Help??

jh has asked for the wisdom of the Perl Monks concerning the following question:

I would like to print out some previously-defined regular expressions in a compact format. The main thing I want to do is strip out the whitespace and comments from those defined with the /x modifier. So if I have two functionally identical regexes:

our $rgx_plain = qr/^([a-z]+)\d*$/; our $rgx_fancy = qr/ ^ # beginning of string ( # begin cap $1 [a-z]+ # one or more letters ) # end cap $1 \d* # optional digits $ # end of string /x;

then I'd like to make a function clean_regex such that

say $rgx_plain; say clean_regex($rgx_fancy);

both print the same thing. Ideally clean_regex is a no-op on regexes defined without the /x modifier, such that

say clean_regex($rgx_plain);

also prints the same thing. I have hacked together something gross and terrible but I was hoping for something better, presumably asking the Regexp compiler what it has once it's done throwing away comments and whitespace.

Thanks!

Replies are listed 'Best First'.
Re: How to strip comments and whitespace from a regex defined with /x?
by Laurent_R (Canon) on Jan 19, 2018 at 20:55 UTC
    Hi jh

    Let me first say that if you intend to do that with a regex or even several regexes, I am afraid this is going to be quite difficult.

    To quote from the documentation on the x modifier:

    A single /x tells the regular expression parser to ignore most whitespace that is neither backslashed nor within a bracketed character class. You can use this to break up your regular expression into more readable parts. Also, the "#" character is treated as a metacharacter introducing a comment that runs up to the pattern's closing delimiter, or to the end of the current line if the pattern extends onto the next line. Hence, this is very much like an ordinary Perl code comment. (You can include the closing delimiter within the comment only if you precede it with a backslash, so be careful!)

    Use of /x means that if you want real whitespace or "#" characters in the pattern (outside a bracketed character class, which is unaffected by /x), then you'll either have to escape them (using backslashes or \Q...\E ) or encode them using octal, hex, or \N{} escapes. It is ineffective to try to continue a comment onto the next line by escaping the \n with a backslash or \Q .

    So, it means, for example, that you can't just remove every thing that comes on a line after a # pound sign, because you can't do it if the pound sign is part of bracketed character class, which means in turn that you need to detect character classes (and that, in itself, is far from trivial). Also, for any pound sign you find, you need to check that it is not escaped by a backslash.

    Assuming that you build a bunch of regexes dealing correctly with pound signs, you then need to deal with white space, which is also quite difficult.

    So, in brief, it is certainly possible to use regexes to do that, but it is likely to be complex and very difficult.

    FWIW, I can think of the following alternatives:

    • To roll out your own automaton reading each character one after the other and remembering at any time the context to decide: am I within a character class definition? Did I just meet a backslash? etc.
    • To use a parser and write your own grammar for it. There are a number of parsing modules on the CPAN, but I am not able to recommend one over the others. I would think this is probably the easiest solution.

    Maybe some other monk(s) will be able to suggest a better solution, but that's what I can think of at the moment.

    Please also note that, starting with Perl 5.26, there is also a xx modifier with different rules.

      it is certainly possible to use regexes to do that

      As long as (?{ }) and (??{ }) aren't supported.

      Maybe some other monk(s) will be able to suggest a better solution

      You could have Perl compile the pattern and recreate the pattern from the compiled form. This could require maintenance every time Perl is upgraded. Then again, same goes for writing your own parser.

      Hi Laurent,

      I am well aware that using regexes to strip comments and whitespace out of an arbitrary regex is terrible, as that's what I am doing now. Fortunately the regexes I am working with are tightly controlled, and most of them are actually auto-generated, so I can be sure they have no whitespace or hash symbols in them.

      But the nevertheless-gross nature of my solution made me wonder if there was something better, hence my thought about "asking the Regexp compiler what it has once it's done throwing away comments and whitespace"

Re: How to strip comments and whitespace from a regex defined with /x?
by RonW (Parson) on Jan 19, 2018 at 22:08 UTC

    Maybe using use re qw(Debug DUMP); would be helpful. See re Debug for more info.

      I had previously checked out the re pragma's debug functions but I didn't find them very useful. For example, either of the regexes in the OP generates:

      synthetic stclass "ANYOF[a-z][]". Final program: 1: BOL (2) 2: OPEN1 (4) 4: PLUS (16) 5: ANYOF[a-z][] (0) 16: CLOSE1 (18) 18: STAR (20) 19: DIGIT (0) 20: EOL (21) 21: END (0) floating ""$ at 1..2147483647 (checking floating) stclass ANYOF[a-z][] + anchored(BOL) minlen 1

      It's heartening that the output for both of them is identical, as I'd expect, but I don't think parsing this output in order to reconstruct the source regex will be significantly easier than modifying the source regex directly :-(

        Probably not easier, but possibly less ambiguous.

        I wonder if constant regex's are compiled once during the compile phase. If so, would be useful if Deparse reconstructed the regex.

        (Currently reading and posting from my tablet, so will try to remember to try to test this, later.)

Re: How to strip comments and whitespace from a regex defined with /x?
by Anonymous Monk on Jan 20, 2018 at 10:07 UTC

      | It is trivial to do as the hard work has been done for you

      Care to elaborate?

      | what kind of "hell hole" are you working for that they want you to do this

      I have regexes in Perl that I want to export to JavaScript, PHP, and Ruby. At least JS doesn't support the /x modifier.

        Care to elaborate?

        Hehe, ok

        In this serialization they show as        " ",  #" ", but in the wx gui you can see these are merely PPIx::Regexp::Token::Whitespace=HASH(0x178d3f4)

        So all you do is walk the tree and delete/remove stuff you dont want,

        when done serialize and what you're left is the non /x version

        html explanation

        Here is text version from the gui it also shows xRe::Token::Comment

        So here is where you'd start, pluck out the whitespace from this tree

        Maybe use Data::Diver ( data diver )

        Or maybe just copy http://search.cpan.org/perldoc/PPIx::Regexp::Node#find

        and use the GUI is for interactive visualization.

        $ perl -MPPIx::Regexp -MData::Dump -le " $r=PPIx::Regexp->new(q{s{ \d+ + \w+ }{}gx}); dd( $r ); print $r->content; " bless({ children => [ bless({ content => "s" }, "PPIx::Regexp::Token::Structure"), bless({ children => [ bless({ content => "\\d" }, "PPIx::Regexp::Token::CharClass::S +imple"), bless({ content => "+" }, "PPIx::Regexp::Token::Quantifier"), bless({ content => " ", perl_version_introduced => "5.000" }, +"PPIx::Regexp::Token::Whitespace"), bless({ content => " ", perl_version_introduced => "5.000" }, +"PPIx::Regexp::Token::Whitespace"), bless({ content => " ", perl_version_introduced => "5.000" }, +"PPIx::Regexp::Token::Whitespace"), bless({ content => " ", perl_version_introduced => "5.000" }, +"PPIx::Regexp::Token::Whitespace"), bless({ content => "\\w" }, "PPIx::Regexp::Token::CharClass::S +imple"), bless({ content => "+" }, "PPIx::Regexp::Token::Quantifier"), bless({ content => " ", perl_version_introduced => "5.000" }, +"PPIx::Regexp::Token::Whitespace"), ], finish => [bless({ content => "}" }, "PPIx::Regexp::Token::Delim +iter")], max_capture_number => 0, start => [ bless({ content => "{" }, "PPIx::Regexp::Token::Delimiter"), bless({ content => " ", perl_version_introduced => "5.000" }, +"PPIx::Regexp::Token::Whitespace"), ], type => [], }, "PPIx::Regexp::Structure::Regexp"), bless({ children => [], finish => [bless({ content => "}" }, "PPIx::Regexp::Token::Del +imiter")], start => [bless({ content => "{" }, "PPIx::Regexp::Token::Del +imiter")], type => [], }, "PPIx::Regexp::Structure::Replacement"), bless({ content => "gx", modifiers => { g => 1, x => 1 } }, "PPIx: +:Regexp::Token::Modifier"), ], effective_modifiers => { g => 1, x => 1 }, failures => 0, source => "s{ \\d+ \\w+ }{}gx", }, "PPIx::Regexp") s{ \d+ \w+ }{}gx
Re: How to strip comments and whitespace from a regex defined with /x?
by Anonymous Monk on Jan 20, 2018 at 16:31 UTC

    And don't forget the effects of use re '/x', which means you can't just look at the qualifiers specified on the regex to determine whether blanks are significant. Automating this by looking at a the chunk of Perl containing the regex and figuring out which use re statements are in-scope is also non-trivial.

    PPIx::Regexp is an automaton that tries to pull a regex apart and see what makes it tick. It does make an attempt to figure out what is significant and what not, but you have to tell its parser what default qualifiers are in effect. There is no machinery to deparse the regex with insignificant elements removed, but a roll-your-own solution should not be too bad. Disclaimer: I am the author, and have a lively appreciation for the possibility of bugs, especially in the plethora of off-the-beaten-path, edge, and corner cases.

Re: How to strip comments and whitespace from a regex defined with /x?
by Anonymous Monk on Jan 20, 2018 at 17:30 UTC

    ... and don't forget /(?# this is a comment )/.

Re: How to strip comments and whitespace from a regex defined with /x? (ppix_regexp_strip_comments)
by Anonymous Monk on Apr 24, 2018 at 02:28 UTC
    As a followup to Re^3: How to strip comments and whitespace from a regex defined with /x?
    #!/usr/bin/perl -- use strict; use warnings; use PPIx::Regexp; use Data::Dump; my $r = PPIx::Regexp->new(q{ s{ (?# probulary ) \d+ # digits [ ] # space \s+ # spaces \w+ # words (?# probulary )}{}gx }); print "\n\n", $r->content, "\n"; dd( $r ); delete_commentsss( $r ); print "\n\n"; dd($r); print "\n\n", $r->content, "\n"; sub delete_commentsss { my( $n ) = @_; for my $child ( eval { $n->children } ){ if( eval{ $child->children } ){ #$haskids delete_commentsss( $child ); } else { for my $il ( qw{ start children finish } ){ if( $n->{$il} ){ @{ $n->{$il} } = grep { not( $_->isa( "PPIx::Regexp::Token::Whitespace +" ) or $_->isa( "PPIx::Regexp::Token::Comment" ) ) } @{ $n->{$il} }; } } } } return $n; } __END__ s{ (?# probulary ) \d+ # digits [ ] # space \s+ # spaces \w+ # words (?# probulary )}{}gx bless({ children => [ bless({ content => "\n", perl_version_introduced => "5.000" }, "PP +Ix::Regexp::Token::Whitespace"), bless({ content => "s" }, "PPIx::Regexp::Token::Structure"), bless({ children => [ bless({ content => "\\d" }, "PPIx::Regexp::Token::CharClass::S +imple"), bless({ content => "+" }, "PPIx::Regexp::Token::Quantifier"), bless({ content => " ", perl_version_introduced => "5.000" }, +"PPIx::Regexp::Token::Whitespace"), bless({ content => "# digits\n" }, "PPIx::Regexp::Token::Comme +nt"), bless({ children => [bless({ content => " " }, "PPIx::Regexp::Token: +:Literal")], finish => [bless({ content => "]" }, "PPIx::Regexp::Token: +:Structure")], start => [bless({ content => "[" }, "PPIx::Regexp::Token: +:Structure")], type => [], }, "PPIx::Regexp::Structure::CharClass"), bless({ content => " ", perl_version_introduced => "5.000" }, +"PPIx::Regexp::Token::Whitespace"), bless({ content => "# space\n" }, "PPIx::Regexp::Token::Commen +t"), bless({ content => "\\s" }, "PPIx::Regexp::Token::CharClass::S +imple"), bless({ content => "+" }, "PPIx::Regexp::Token::Quantifier"), bless({ content => " ", perl_version_introduced => "5.000" }, +"PPIx::Regexp::Token::Whitespace"), bless({ content => "# spaces\n" }, "PPIx::Regexp::Token::Comme +nt"), bless({ content => "\\w" }, "PPIx::Regexp::Token::CharClass::S +imple"), bless({ content => "+" }, "PPIx::Regexp::Token::Quantifier"), bless({ content => " ", perl_version_introduced => "5.000" }, +"PPIx::Regexp::Token::Whitespace"), bless({ content => "# words \n" }, "PPIx::Regexp::Token::Comme +nt"), bless({ content => "(?# probulary )" }, "PPIx::Regexp::Token:: +Comment"), ], finish => [bless({ content => "}" }, "PPIx::Regexp::Token::Delim +iter")], max_capture_number => 0, start => [ bless({ content => "{" }, "PPIx::Regexp::Token::Delimiter"), bless({ content => "\n", perl_version_introduced => "5.000" }, + "PPIx::Regexp::Token::Whitespace"), bless({ content => "(?# probulary )" }, "PPIx::Regexp::Token:: +Comment"), bless({ content => "\n", perl_version_introduced => "5.000" }, + "PPIx::Regexp::Token::Whitespace"), ], type => [], }, "PPIx::Regexp::Structure::Regexp"), bless({ content => "{" }, "PPIx::Regexp::Token::Delimiter"), bless({ content => "}" }, "PPIx::Regexp::Token::Delimiter"), bless({ content => "gx", modifiers => { g => 1, x => 1 } }, "PPIx: +:Regexp::Token::Modifier"), bless({ content => "\n", perl_version_introduced => "5.000" }, "PP +Ix::Regexp::Token::Whitespace"), ], effective_modifiers => { g => 1, x => 1 }, failures => 0, source => "\ns{\n(?# probulary )\n\\d+ # digits\n[ ] # space\n\\s+ # + spaces\n\\w+ # words \n(?# probulary )}{}gx\n", }, "PPIx::Regexp") bless({ children => [ bless({ content => "s" }, "PPIx::Regexp::Token::Structure"), bless({ children => [ bless({ content => "\\d" }, "PPIx::Regexp::Token::CharClass::S +imple"), bless({ content => "+" }, "PPIx::Regexp::Token::Quantifier"), bless({ children => [bless({ content => " " }, "PPIx::Regexp::Token: +:Literal")], finish => [bless({ content => "]" }, "PPIx::Regexp::Token: +:Structure")], start => [bless({ content => "[" }, "PPIx::Regexp::Token: +:Structure")], type => [], }, "PPIx::Regexp::Structure::CharClass"), bless({ content => "\\s" }, "PPIx::Regexp::Token::CharClass::S +imple"), bless({ content => "+" }, "PPIx::Regexp::Token::Quantifier"), bless({ content => "\\w" }, "PPIx::Regexp::Token::CharClass::S +imple"), bless({ content => "+" }, "PPIx::Regexp::Token::Quantifier"), ], finish => [bless({ content => "}" }, "PPIx::Regexp::Token::Delim +iter")], max_capture_number => 0, start => [bless({ content => "{" }, "PPIx::Regexp::Token::Delimi +ter")], type => [], }, "PPIx::Regexp::Structure::Regexp"), bless({ content => "{" }, "PPIx::Regexp::Token::Delimiter"), bless({ content => "}" }, "PPIx::Regexp::Token::Delimiter"), bless({ content => "gx", modifiers => { g => 1, x => 1 } }, "PPIx: +:Regexp::Token::Modifier"), ], effective_modifiers => { g => 1, x => 1 }, failures => 0, source => "\ns{\n(?# probulary )\n\\d+ # digits\n[ ] # space\n\\s+ # + spaces\n\\w+ # words \n(?# probulary )}{}gx\n", }, "PPIx::Regexp") s{\d+[ ]\s+\w+}{}gx