TL;DR: The two code samples below are working pieces of code that can be copied into your Perl script and adapted for your purposes. A compact version is: my ($regex) = map {qr/$_/} join '|', map {quotemeta} sort { length $b <=> length $a or $a cmp $b } keys %map;
I thought it might be useful to explain the technique of building regular expressions dynamically from a set of strings. Let's say you have a list of strings, like ('abc.', 'd$ef', 'gh|i') (that's a $ character, not a variable), and you want to build a regex that matches any of them, like /(?:abc\.|d\$ef|gh\|i)/ - note how the special characters are escaped with backslashes so they lose their special meanings and will be matched literally (more on this below). This also works well with s/search/replacement/ if you have a hash where the keys are the search strings and the values are the replacements, as I'll show below. If you're uncertain on some of the regex concepts used here, like alternations a|b and non-capturing groups (?:...), I recommend perlretut.
First, the basic code, which I explain below - note the numbering on the lines of code.
my @values = qw/ a ab. d ef def g|h /;
my ($regex) = map { qr/$_/ } # 5.
join '|', # 4.
map {quotemeta} # 3.
sort { length $b <=> length $a } # 2.
@values; # 1.
print "$regex\n"; # 6.
- We begin with the list of strings stored in the array @values. This could be any list, such as a literal qw/.../, or return values from functions, including keys or values.
- We sort the list so that the longer strings appear first. This is necessary because if we didn't do this and our regular expression was /foo|foobar/, then applied to the string "foobarfoofoobar", it would only match "foo" three times, and never "foobar". But if the regex is /foobar|foo/, then it would correctly match "foobar", "foo", and again "foobar".
- Next, we apply the quotemeta function to each string, which escapes any metacharacters that might have special meaning in a regex, such as . (dot, matches anything), | (alternation operator), or $ (anchor to end of line/string). In our example, we want the string "g|h" to be matched literally, and not to mean "match g or h". Unescaped metacharacters can also break the syntax of the regex, like stray opening parentheses or similar. Note that quotemeta is the same as using \Q...\E in a regex. As discussed here, you should only drop \Q...\E or quotemeta in the case that you explicitly want metacharacters in your input strings to be special, they come from a trusted source, and you are certain that your strings don't contain any characters that would break your regular expression or expose security holes!
- Then, we join the strings into one long string with the regex alternation operator | in between each string. The string returned by join in the example looks like this: ab\.|def|g\|h|ef|a|d
- This step compiles the regular expression using qr//. If you want to add modifiers such as /i (case-insensitive matching), this would be the place to do it, as in qr/$_/i. This line of code needs a bit of explanation: join from the previous step will return a single string, and so the map will evaluate its code block { qr/$_/ } once, with $_ being the string returned by join. The parentheses in my ($regex) = are required so that map will return the value from its code block (map in "list context"), instead of a count of the values (map in "scalar context") (for a trick on how to avoid the parentheses, see here). Context in Perl is a topic for another tutorial. Please note that if you want to add extra things to match in this qr//, then you most likely will want to write (?:$_) - the reason for this will be explained below. For example, if you want to apply the "word boundary" \b, you need to write qr/\b(?:$_)\b/.
When we print the regular expression, we see that it has become this:
(?^:ab\.|def|g\|h|ef|a|d)
You can now use this precompiled regular expression anywhere, as explained in Compiling and saving regular expressions and perlop, such as:
if ($input=~$regex) {
print "It matches!\n";
}
# or
while ($input=~/($regex)/g) {
print "Found string: $1\n";
}
Note that the qr// operator has implicitly added a non-capturing group (?:...) around the regular expression. This is important when you want to use the regular expression we've just built as part of a larger expression. For example, if your input strings are qw/a b c/ and you write /^$regex$/, then what you probably meant is /^(?:a|b|c)$/. If the non-capturing group wasn't there, then the regex would look like this: /^a|b|c$/, which means "match a only at the beginning of the string, or b anywhere in the string, or c only at the end of the string", which is probably not what you meant! (In the previous step, the same problem can happen, but you're responsible for adding the (?:...) around the $_ yourself, because at that point, $_ is just a plain string, and not yet a precompiled regular expression.)
Search and Replace Using a Hash
my %map = ( a=>1, ab=>23, cd=>45 ); # 1.
my ($regex) = map { qr/$_/ } # 2.
join '|', map {quotemeta}
sort { length $b <=> length $a
or $a cmp $b } # 3.
keys %map;
print "$regex\n"; # 4.
# Now, use the regex
my @strings = qw/ abcd aacd abaab /; # 5.
for (@strings) {
my $before = $_;
s/($regex)/$map{$1}/g; # 6.
print "$before -> $_\n"; # 7.
}
- This is the hash in which the keys are the search strings, and the values are the replacements. As above, this can come from any source.
- This code to build the regex is mostly the same as the above, with the following difference:
- Instead of only sorting by length, this sort first sorts by length, and sorts values with the same length with a stringwise sort. While not strictly necessary, I would recommend this because hashes are unordered by default, meaning that your regex would be in a different order across different runs of the program. Sorting the hash keys like this causes the regex to be in the same order in every run of the program.
- We print the regex for debugging, and see that it looks like this: (?^:ab|cd|a)
- These @strings are the test strings we will apply the regular expression against.
- This is the search and replace operation that matches the keys of the hash, and as a replacement value gets the corresponding value from the hash. Note that the /g modifier is not strictly required (s///g will replace all matches in the string, not just the first), and you can adapt this regex any way you like. So for example, to only make one replacement anchored at the beginning of the string, you can say s/^($regex)/$map{$1}/;.
- The output of the code is:
abcd -> 2345
aacd -> 1145
abaab -> 23123
Thank you to all those who replied to this post as well as this one, in particular thanks to kcott, LanX, AnomalousMonk, and Haarg, whose suggestions ended up in the above!
Hope this helps, -- Hauke D
Updates: 2017-05-14: Merged in the draft text I previously had in this node, made several updates to the text, and removed the "RFC" tag from the title. 2019-05-01: Updated first section regarding $_ in qr// (points 5 and 6), and updated TL;DR with a bit of code.
Re: [RFC] Building Regex Alternations Dynamically
by kcott (Archbishop) on Jan 19, 2017 at 10:11 UTC
|
G'day Hauke,
++
This looks like a very good start;
seems reasonably complete;
and covers most of the points I might have made.
I have comments on two areas, as follows.
With any sort of tutorial, those reading it
— to learn about the subject, rather than for reviewing, proof-reading, etc. —
probably start with limited knowledge.
Accordingly, any terms used should be unambiguous;
unfortunately, you've used $regex to mean two different things:
$regex = join ...
$regex = qr/...
I'm familiar with both the subject matter and the technique, so this posed no problem for me;
however, for someone learning this, it may do.
While it's reasonably obvious in the short code example, half a page later, in the middle of descriptive text,
the appearance of $regex might not be as obvious to the student as it is to you or I.
Consider renaming those; purely as a suggestion:
$regex_base_str = join ...
$regex_compiled = qr/...
In points (4) & (5), in the first list, you show grouping.
To resolve the same issue in both,
you use explicit capture grouping in (4),
and implicit non-capture grouping in (5).
Regex pieces used for alternation often occur as part of a larger regex;
in fact, I suspect that's the more usual case.
This may be as simple as the anchor assertions you show in (4), or could be a lot more complex.
I'd suggest adding explicit non-capturing grouping to $regex_base_str (or whatever you call it)
as part of the normal technique.
To demonstrate:
# Simple case: OK - matches "a" or "b"
$ perl -E 'my $re = "a|b"; $re = qr{$re}; say $re'
(?^u:a|b)
# Complex case: NOT OK - matches "Xa" or "bY"
$ perl -E 'my $re = "a|b"; $re = qr{X${re}Y}; say $re'
(?^u:Xa|bY)
# Complex case: OK - matches "a" or "b" [fixed with "(?:...)"]
$ perl -E 'my $re = "(?:a|b)"; $re = qr{X${re}Y}; say $re'
(?^u:X(?:a|b)Y)
| [reply] [d/l] [select] |
|
Hi Ken,
Thank you very much for your thoughtful reply!
With any sort of tutorial, those reading it — to learn about the subject, rather than for reviewing, proof-reading, etc. — probably start with limited knowledge. Accordingly, any terms used should be unambiguous; unfortunately, you've used $regex to mean two different things
Excellent point. I've renamed the variables to disambiguate (I kept the names shorter though), and I've added a link to perlretut.
In points (4) & (5), in the first list, you show grouping. To resolve the same issue in both, you use explicit capture grouping in (4), and implicit non-capture grouping in (5).
Another excellent point, I've switched to using the non-capturing groups in all the examples.
I'd suggest adding explicit non-capturing grouping to $regex_base_str (or whatever you call it) as part of the normal technique.
That is a very good point, but I haven't made the change yet because I need to think on it a bit more. On the one hand, I think that adding an extra (?:...) makes the generated regex look a little more complex than it needs to be (qr/(?:a|b)/ eq "(?^:(?:a|b))" and on older Perls qr/(?:a|b)/ eq "(?-xism:(?:a|b))"), and also it makes the code to generate the string less elegant (my $regex_str = '(?:'.join('|', map ... ).')'). But those are just stylistic concerns and you're right that it would eliminate the pitfall that I have to discuss at length in points 4 and 5.
The other potential solution, which I'm currently leaning towards, is to skip the intermediate string variable, like in Haarg's post here: my ($regex) = map {qr/$_/} join '|', map .... I like this latter approach better because it's more robust (no string for the user to potentially misuse), but it does add one more "trick" that has to be explained to the beginner. Currently I feel that the advantages of that outweigh the disadvantages...
Update 2, 2017-05-14: This node used to contain a draft, which I've now incorporated into the root node. I wanted to preserve the original text here:
Thanks, -- Hauke D
| [reply] [d/l] [select] |
|
(The following is perhaps a bit OT to the main thread, or else already touched upon in an update. Oh, well...)
... I think that adding an extra (?:...) makes the generated regex look a little more complex than it needs to be ... those are just stylistic concerns ...
Please see Re: Recognizing 3 and 4 digit number and thereunder for a long discussion between myself and kcott on these "stylistic concerns." Personally, I still don't see the need for the extra explicit (?:...) wrap step. The implicit wrap becomes explicit quick enough if you print the stringized Regexp object, and this feature of a Regexp object should be deeply understood from the moment one begins to use them.
... skip the intermediate string variable, like in Haarg's post here: my ($regex) = map {qr/$_/} join '|', map .... ... it's more robust ...
To continue the previous point, I feel it's important to get a regex into a Regexp form as quickly as possible: no dilly-dallying. Once objectified, it can be used atomically when composing more complex regex expressions:
my $rx = qr{ ... }xms;
...
$string =~ m{ $rx* $rx+? $rx{3,4} }xms and do_something();
(Of course, this compositional capability is also addressed by the (DEFINE) predicate of the (?(condition)yes-pattern) conditional expression of Perl 5.10+.)
The only situation which I'm aware of in which this compositional atomicity breaks down is for something like
my ($n, $m) = (3, 4);
...
$string =~ m{ $rx{$n,$m} }xms and do_something();
where $rx{$n} $rx{$n,} $rx{$n,$m} are all taken as hash element accesses. This can be fixed simply by an explicit layer of non-capturing group wrapping (entirely necessary here!):
(?:$rx){$n} (?:$rx){$n,} (?:$rx){$n,$m}
Give a man a fish: <%-{-{-{-<
| [reply] [d/l] [select] |
Re: [RFC] Building Regex Alternations Dynamically
by duelafn (Parson) on Jan 18, 2017 at 16:14 UTC
|
| [reply] |
|
| [reply] |
Re: [RFC] Building Regex Alternations Dynamically
by Discipulus (Canon) on Jan 19, 2017 at 08:33 UTC
|
double (if possible) ++haukex and immediatlely bookmarked!
In my opinion is not too much to read. Even if there are two arguments discussed they are similar and explained very well in such a plain way that everyone (including me!) can understand. The order of comments let's even a beginner to follow the flow of regex building.
L*
There are no rules, there are no thumbs..
Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.
| [reply] [d/l] |
Re: [RFC] Building Regex Alternations Dynamically
by LanX (Saint) on Jan 18, 2017 at 13:56 UTC
|
it's almost TL;DR, so two separate posts my have been better. :)
Just three remarks:
- IIRC there is a hard limit for the length of a regex-string ... (or was it only the repetition counter .{x} limited to x aprox 32000 ?)
- "newer" Perls come with something called trie-optimization which will considerably speed up such or-clauses.
- Trie-optimization has definitely a hard limit, where it falls back to (much slower) old behaviour.
| [reply] [d/l] |
|
Hi LanX,
Thanks for your comments, I'll think about whether I can provide a TL;DR version - my thought was that impatient users could just copy-n-paste my code :-)
IIRC there is a hard limit for the length of a regex-string
I'm not sure, I'll have to look into it, but I do know this works :-)
my $word_regex = do {
open my $wfh, '<:utf8', '/usr/share/dict/words' or die $!;
chomp( my @words = <$wfh> );
close $wfh;
my $re = join '|',
map {quotemeta}
sort { length $b <=> length $a or $a cmp $b }
@words;
qr/$re/i };
print length($word_regex), "\n";
__END__
964836
Thanks, -- Hauke D | [reply] [d/l] |
|
> > IIRC there is a hard limit for the length of a regex-string
Maybe not in the string length but for sure in the processing.
something like /x{33000}/ will result in
Quantifier in {,} bigger than 32766 in regex; ...
But that's not exactly relevant for your case :)
| [reply] [d/l] [select] |
|
Hi Rolf,
IIRC there is a hard limit for the length of a regex-string
Other than RAM, I haven't found one yet, I just built a qr// regex from a 1GB text file without alternations and a second from 40MB of text with over 10M alternations.
As you may have seen I added a TL;DR, thanks for the idea!
Regards, -- Hauke D
| [reply] [d/l] |
|
Yeah sorry I had
> > (or was it only the repetition counter .{x} limited to x aprox 32000 ?)
in mind. Since regexes are compiled to op codes hard limits are unlikely.
> As you may have seen I added a TL;DR, thanks for the idea!
Hmm probably it's time to include support for TOC = table of content in my wiki add on :)
| [reply] |
Re: [RFC] Building Regex Alternations Dynamically
by AnomalousMonk (Archbishop) on Jan 19, 2017 at 19:08 UTC
|
Let's say you have a list of strings, like ("abc", "def", "ghi") ...
I think it's very important to highlight the fact that these strings are assumed to be literal strings or their equivalent, and cannot contain regex "operators" per se. Perhaps "Let's say you have a list of literal strings or their moral equivalent, like ..." You make this point further on in point 3 in the first group of bullet points, but I think it should be emphasized early and often.
sort { length $b <=> length $a } # 2.
sort { length $b <=> length $a
or $a cmp $b } # 3.
Your discussion of sorting by length in an alternation of this kind is critical. I feel the desired end can be achieved in a more uniform way, although it's a bit more tricky to explain.
Say we have the set qw(ab abc abcd wx wxy wxyz) of literal strings we wish to build into an alternation. To build a "longest match" ordered alternation, it's essential to have abcd appear before abc and abc before ab in the alternation as you've explained, but it's irrelevant where wx wxy wxyz appear relative to the former subgroup. The reason is that nothing in the literal wx wxy wxyz subgroup can possibly match anything that the literal ab abc abcd subgroup matches and vice versa. So a standard
reverse sort
expression serves in all cases.
c:\@Work\Perl\monks>perl -wMstrict -le
"my ($alt) =
map qr{$_}xms,
join '|',
map quotemeta,
reverse sort
qw(wx ab wxy abc wxyz abcd)
;
print $alt;
"
(?^msx:wxyz|wxy|wx|abcd|abc|ab)
Does it ever matter that members of the wx wxy wxyz subgroup fall before, after, or within the ab abc abcd subgroup? Never. The regex
(?^msx:abcd|wxyz|wxy|abc|ab|wx)
is functionally equivalent (in terms of longest matching) to
(?^msx:wxyz|wxy|wx|abcd|abc|ab)
In addition, there's the whole $b/$a versus $a/$b issue in controlling descending versus ascending sorting. This is very easy to screw up and can be very hard to see to fix; reverse sort eliminates this ambiguity.
A situation in which a performance difference might emerge is if it were known that, e.g., ab was overwhelmingly more likely to occur than wx and so should be tested first (along with all its cohort) to avoid needlessly burning computrons in a heavy-duty matching situation:
(?^msx:abcd|abc|ab|wxyz|wxy|wx)
But this is a situation that requires carefully hand-crafting a regex, and may be outside the scope of your article.
I tend to use a reverse sort step in all cases when I'm building this type of alternation, even when all substrings to be matched are known to be mutually exclusive or when the alternation will be anchored in such a way that alternation match length is not a consideration. (Of course, if the programer wants a shortest match alternation, that's an entirely different question, and also a point upon which you've touched.)
Give a man a fish: <%-{-{-{-<
| [reply] [d/l] [select] |
|
So a standard reverse sort expression serves in all cases.
As I /msg'ed you at the time you posted this, I had a doubt about this - and now I've finally found the time to confirm that I wasn't being entirely paranoid ;-) Under locale, it's possible for longer strings to be sorted before shorter ones. Here are the instructions to reproduce (on a Debian-based system):
$ mkdir -v /tmp/localetest; cd /tmp/localetest
# copy "en_US" and the files it refers to via its "copy" statements
$ cp -v /usr/share/i18n/locales/{en_US,en_GB,i18n,iso14651_t1,iso14651
+_t1_common} .
$ mv -v en_US en_TESTING
Now edit the file en_TESTING and insert the following into the LC_COLLATE section after the copy "iso14651_t1" statement:
collating-element <bc> from "<U0062><U0063>"
script <TESTING>
order_start <TESTING>;forward;forward;forward;forward,position
<bc> "<a><c>";"<BAS><BAS>";"<MIN><MIN>";IGNORE
order_end
Now compile and test the locale:
$ localedef -i en_TESTING -f UTF-8 -c ./en_TESTING.UTF-8
$ LOCPATH=/tmp/localetest LC_ALL=en_TESTING.UTF-8 perl -Mlocale -le 'p
+rint for sort qw/ab a b bc/'
a
ab
bc
b
$ LOCPATH=/tmp/localetest LC_ALL=en_TESTING.UTF-8 perl -wMstrict -lMlo
+cale
print "1 $_" for "abcabbc"=~/${\join "|", reverse sort qw(b bc) }/g;
print "2 $_" for "abcabbc"=~/${\join "|", sort {length$b<=>length$a} q
+w(b bc) }/g;
__END__
1 b
1 b
1 b
2 bc
2 b
2 bc
Now of course the real locale definitions are long and I don't plan on reading all of them, and so it's entirely possible that currently no locales exist that define this kind of sorting order. But I'll just be a bit paranoid and stick with sorting by length explicitly :-)
Update: I did incorporate your suggestion about highlighting the fact that strings are matched literally, thank you! | [reply] [d/l] [select] |
|
In all my CS/IS career, I've been able get away with taking no more than a couple of quick steps into locale-land and then turning and scurrying back to safety. I will have to do a lot more reading and experimenting to be able to respond meaningfully to what you've posted, but it looks... interesting.
Give a man a fish: <%-{-{-{-<
| [reply] [d/l] [select] |
Re: Building Regex Alternations Dynamically
by jo37 (Deacon) on May 17, 2020 at 09:46 UTC
|
It took some time until I discovered this jewel.
It's great. Thanks a lot!
One thought about the usage of map: Here it needs to be in list context, which is a bit odd when declaring a scalar.
In a version that has two characters more to type this can be avoided:
my $regex = sub { qr/@_/ }->(
join '|',
map {quotemeta}
sort { length $b <=> length $a }
@values);
print "$regex\n";
Greetings, -jo
$gryYup$d0ylprbpriprrYpkJl2xyl~rzg??P~5lp2hyl0p$
| [reply] [d/l] [select] |
|
| [reply] [d/l] |
Re: Building Regex Alternations Dynamically (fusselkerl)
by shmem (Chancellor) on Jun 16, 2020 at 17:05 UTC
|
#!/usr/bin/perl
# file fusselkerl
use strict;
die "usage: $0 pattern [file]\n" unless @ARGV;
my $p;
{
my (%s, %i);
my $d = my $c = 1; # our regexp will be inside parens, so first back
+ref is 2
$p = join (
"",
map {
if($s{$_}++){
"\\".$i{$_}
}
else{
$i{$_}=++$c;
$c>$d+1 ? '(?!'.join('|',map{"\\".abs}-$c+1..-$d-1).")(\\w)" :
+ "(\\w)";
}
} split //, shift
);
}
print '(',$p,")\n";
local $" = ', ';
my %s;
while (<>) {
my @l = ();
while (/\b($p)\b/g) {
push @l, $1 unless $s{$1}++;
}
print "@l\n" if @l;
}
try
perl fusselkerl fusselkerl /usr/share/dict/words
edit: oh, I posted that already years ago... Re: How to use a negative backreference in regex?, sorry ;)
perl -le'print map{pack c,($-++?1:13)+ord}split//,ESEL'
| [reply] [d/l] [select] |
|
|