Re: How to strip comments and whitespace from a regex defined with /x?

Hi jh

Let me first say that if you intend to do that with a regex or even several regexes, I am afraid this is going to be quite difficult.

To quote from the documentation on the x modifier:

A single /x tells the regular expression parser to ignore most whitespace that is neither backslashed nor within a bracketed character class. You can use this to break up your regular expression into more readable parts. Also, the "#" character is treated as a metacharacter introducing a comment that runs up to the pattern's closing delimiter, or to the end of the current line if the pattern extends onto the next line. Hence, this is very much like an ordinary Perl code comment. (You can include the closing delimiter within the comment only if you precede it with a backslash, so be careful!)
Use of /x means that if you want real whitespace or "#" characters in the pattern (outside a bracketed character class, which is unaffected by /x), then you'll either have to escape them (using backslashes or \Q...\E ) or encode them using octal, hex, or \N{} escapes. It is ineffective to try to continue a comment onto the next line by escaping the \n with a backslash or \Q .

So, it means, for example, that you can't just remove every thing that comes on a line after a # pound sign, because you can't do it if the pound sign is part of bracketed character class, which means in turn that you need to detect character classes (and that, in itself, is far from trivial). Also, for any pound sign you find, you need to check that it is not escaped by a backslash.

Assuming that you build a bunch of regexes dealing correctly with pound signs, you then need to deal with white space, which is also quite difficult.

So, in brief, it is certainly possible to use regexes to do that, but it is likely to be complex and very difficult.

FWIW, I can think of the following alternatives:

To roll out your own automaton reading each character one after the other and remembering at any time the context to decide: am I within a character class definition? Did I just meet a backslash? etc.
To use a parser and write your own grammar for it. There are a number of parsing modules on the CPAN, but I am not able to recommend one over the others. I would think this is probably the easiest solution.

Maybe some other monk(s) will be able to suggest a better solution, but that's what I can think of at the moment.

Please also note that, starting with Perl 5.26, there is also a xx modifier with different rules.

Comment on Re: How to strip comments and whitespace from a regex defined with /x? Select or Download Code

Replies are listed 'Best First'.
Re^2: How to strip comments and whitespace from a regex defined with /x? by ikegami (Patriarch) on Jan 21, 2018 at 18:53 UTC
it is certainly possible to use regexes to do that As long as `(?{ })` and `(??{ })` aren't supported. Maybe some other monk(s) will be able to suggest a better solution You could have Perl compile the pattern and recreate the pattern from the compiled form. This could require maintenance every time Perl is upgraded. Then again, same goes for writing your own parser.	[reply] [d/l] [select]
Re^2: How to strip comments and whitespace from a regex defined with /x? by kcott (Archbishop) on Jan 20, 2018 at 22:03 UTC
G'day Laurent, ++ I pulled up some details about '`/xx`' before I noticed you'd already mentioned it. Anyway, purely for completeness, the link I was going to post is: "perl5260delta: New regular expression modifier /xx". Furthermore, the link into perlre that provides is bogus: there is no "`#/x-and-/xx`" fragment identifier; the closest would be "Details on some modifiers" ("`/x` and `/xx`" is the first item in that section). — Ken	[reply] [d/l] [select]
Re^2: How to strip comments and whitespace from a regex defined with /x? by jh (Beadle) on Jan 29, 2018 at 17:36 UTC
Hi Laurent, I am well aware that using regexes to strip comments and whitespace out of an arbitrary regex is terrible, as that's what I am doing now. Fortunately the regexes I am working with are tightly controlled, and most of them are actually auto-generated, so I can be sure they have no whitespace or hash symbols in them. But the nevertheless-gross nature of my solution made me wonder if there was something better, hence my thought about "asking the Regexp compiler what it has once it's done throwing away comments and whitespace"	[reply]


go ahead... be a heretic
	PerlMonks