Re: [RFC] Building Regex Alternations Dynamically
by kcott (Archbishop) on Jan 19, 2017 at 10:11 UTC
|
G'day Hauke,
++
This looks like a very good start;
seems reasonably complete;
and covers most of the points I might have made.
I have comments on two areas, as follows.
With any sort of tutorial, those reading it
— to learn about the subject, rather than for reviewing, proof-reading, etc. —
probably start with limited knowledge.
Accordingly, any terms used should be unambiguous;
unfortunately, you've used $regex to mean two different things:
$regex = join ...
$regex = qr/...
I'm familiar with both the subject matter and the technique, so this posed no problem for me;
however, for someone learning this, it may do.
While it's reasonably obvious in the short code example, half a page later, in the middle of descriptive text,
the appearance of $regex might not be as obvious to the student as it is to you or I.
Consider renaming those; purely as a suggestion:
$regex_base_str = join ...
$regex_compiled = qr/...
In points (4) & (5), in the first list, you show grouping.
To resolve the same issue in both,
you use explicit capture grouping in (4),
and implicit non-capture grouping in (5).
Regex pieces used for alternation often occur as part of a larger regex;
in fact, I suspect that's the more usual case.
This may be as simple as the anchor assertions you show in (4), or could be a lot more complex.
I'd suggest adding explicit non-capturing grouping to $regex_base_str (or whatever you call it)
as part of the normal technique.
To demonstrate:
# Simple case: OK - matches "a" or "b"
$ perl -E 'my $re = "a|b"; $re = qr{$re}; say $re'
(?^u:a|b)
# Complex case: NOT OK - matches "Xa" or "bY"
$ perl -E 'my $re = "a|b"; $re = qr{X${re}Y}; say $re'
(?^u:Xa|bY)
# Complex case: OK - matches "a" or "b" [fixed with "(?:...)"]
$ perl -E 'my $re = "(?:a|b)"; $re = qr{X${re}Y}; say $re'
(?^u:X(?:a|b)Y)
| [reply] [d/l] [select] |
|
Hi Ken,
Thank you very much for your thoughtful reply!
With any sort of tutorial, those reading it — to learn about the subject, rather than for reviewing, proof-reading, etc. — probably start with limited knowledge. Accordingly, any terms used should be unambiguous; unfortunately, you've used $regex to mean two different things
Excellent point. I've renamed the variables to disambiguate (I kept the names shorter though), and I've added a link to perlretut.
In points (4) & (5), in the first list, you show grouping. To resolve the same issue in both, you use explicit capture grouping in (4), and implicit non-capture grouping in (5).
Another excellent point, I've switched to using the non-capturing groups in all the examples.
I'd suggest adding explicit non-capturing grouping to $regex_base_str (or whatever you call it) as part of the normal technique.
That is a very good point, but I haven't made the change yet because I need to think on it a bit more. On the one hand, I think that adding an extra (?:...) makes the generated regex look a little more complex than it needs to be (qr/(?:a|b)/ eq "(?^:(?:a|b))" and on older Perls qr/(?:a|b)/ eq "(?-xism:(?:a|b))"), and also it makes the code to generate the string less elegant (my $regex_str = '(?:'.join('|', map ... ).')'). But those are just stylistic concerns and you're right that it would eliminate the pitfall that I have to discuss at length in points 4 and 5.
The other potential solution, which I'm currently leaning towards, is to skip the intermediate string variable, like in Haarg's post here: my ($regex) = map {qr/$_/} join '|', map .... I like this latter approach better because it's more robust (no string for the user to potentially misuse), but it does add one more "trick" that has to be explained to the beginner. Currently I feel that the advantages of that outweigh the disadvantages...
Update 2, 2017-05-14: This node used to contain a draft, which I've now incorporated into the root node. I wanted to preserve the original text here:
Thanks, -- Hauke D
| [reply] [d/l] [select] |
|
(The following is perhaps a bit OT to the main thread, or else already touched upon in an update. Oh, well...)
... I think that adding an extra (?:...) makes the generated regex look a little more complex than it needs to be ... those are just stylistic concerns ...
Please see Re: Recognizing 3 and 4 digit number and thereunder for a long discussion between myself and kcott on these "stylistic concerns." Personally, I still don't see the need for the extra explicit (?:...) wrap step. The implicit wrap becomes explicit quick enough if you print the stringized Regexp object, and this feature of a Regexp object should be deeply understood from the moment one begins to use them.
... skip the intermediate string variable, like in Haarg's post here: my ($regex) = map {qr/$_/} join '|', map .... ... it's more robust ...
To continue the previous point, I feel it's important to get a regex into a Regexp form as quickly as possible: no dilly-dallying. Once objectified, it can be used atomically when composing more complex regex expressions:
my $rx = qr{ ... }xms;
...
$string =~ m{ $rx* $rx+? $rx{3,4} }xms and do_something();
(Of course, this compositional capability is also addressed by the (DEFINE) predicate of the (?(condition)yes-pattern) conditional expression of Perl 5.10+.)
The only situation which I'm aware of in which this compositional atomicity breaks down is for something like
my ($n, $m) = (3, 4);
...
$string =~ m{ $rx{$n,$m} }xms and do_something();
where $rx{$n} $rx{$n,} $rx{$n,$m} are all taken as hash element accesses. This can be fixed simply by an explicit layer of non-capturing group wrapping (entirely necessary here!):
(?:$rx){$n} (?:$rx){$n,} (?:$rx){$n,$m}
Give a man a fish: <%-{-{-{-<
| [reply] [d/l] [select] |
Re: [RFC] Building Regex Alternations Dynamically
by duelafn (Parson) on Jan 18, 2017 at 16:14 UTC
|
| [reply] |
|
| [reply] |
Re: [RFC] Building Regex Alternations Dynamically
by Discipulus (Canon) on Jan 19, 2017 at 08:33 UTC
|
double (if possible) ++haukex and immediatlely bookmarked!
In my opinion is not too much to read. Even if there are two arguments discussed they are similar and explained very well in such a plain way that everyone (including me!) can understand. The order of comments let's even a beginner to follow the flow of regex building.
L*
There are no rules, there are no thumbs..
Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.
| [reply] [d/l] |
Re: [RFC] Building Regex Alternations Dynamically
by AnomalousMonk (Archbishop) on Jan 19, 2017 at 19:08 UTC
|
Let's say you have a list of strings, like ("abc", "def", "ghi") ...
I think it's very important to highlight the fact that these strings are assumed to be literal strings or their equivalent, and cannot contain regex "operators" per se. Perhaps "Let's say you have a list of literal strings or their moral equivalent, like ..." You make this point further on in point 3 in the first group of bullet points, but I think it should be emphasized early and often.
sort { length $b <=> length $a } # 2.
sort { length $b <=> length $a
or $a cmp $b } # 3.
Your discussion of sorting by length in an alternation of this kind is critical. I feel the desired end can be achieved in a more uniform way, although it's a bit more tricky to explain.
Say we have the set qw(ab abc abcd wx wxy wxyz) of literal strings we wish to build into an alternation. To build a "longest match" ordered alternation, it's essential to have abcd appear before abc and abc before ab in the alternation as you've explained, but it's irrelevant where wx wxy wxyz appear relative to the former subgroup. The reason is that nothing in the literal wx wxy wxyz subgroup can possibly match anything that the literal ab abc abcd subgroup matches and vice versa. So a standard
reverse sort
expression serves in all cases.
c:\@Work\Perl\monks>perl -wMstrict -le
"my ($alt) =
map qr{$_}xms,
join '|',
map quotemeta,
reverse sort
qw(wx ab wxy abc wxyz abcd)
;
print $alt;
"
(?^msx:wxyz|wxy|wx|abcd|abc|ab)
Does it ever matter that members of the wx wxy wxyz subgroup fall before, after, or within the ab abc abcd subgroup? Never. The regex
(?^msx:abcd|wxyz|wxy|abc|ab|wx)
is functionally equivalent (in terms of longest matching) to
(?^msx:wxyz|wxy|wx|abcd|abc|ab)
In addition, there's the whole $b/$a versus $a/$b issue in controlling descending versus ascending sorting. This is very easy to screw up and can be very hard to see to fix; reverse sort eliminates this ambiguity.
A situation in which a performance difference might emerge is if it were known that, e.g., ab was overwhelmingly more likely to occur than wx and so should be tested first (along with all its cohort) to avoid needlessly burning computrons in a heavy-duty matching situation:
(?^msx:abcd|abc|ab|wxyz|wxy|wx)
But this is a situation that requires carefully hand-crafting a regex, and may be outside the scope of your article.
I tend to use a reverse sort step in all cases when I'm building this type of alternation, even when all substrings to be matched are known to be mutually exclusive or when the alternation will be anchored in such a way that alternation match length is not a consideration. (Of course, if the programer wants a shortest match alternation, that's an entirely different question, and also a point upon which you've touched.)
Give a man a fish: <%-{-{-{-<
| [reply] [d/l] [select] |
|
So a standard reverse sort expression serves in all cases.
As I /msg'ed you at the time you posted this, I had a doubt about this - and now I've finally found the time to confirm that I wasn't being entirely paranoid ;-) Under locale, it's possible for longer strings to be sorted before shorter ones. Here are the instructions to reproduce (on a Debian-based system):
$ mkdir -v /tmp/localetest; cd /tmp/localetest
# copy "en_US" and the files it refers to via its "copy" statements
$ cp -v /usr/share/i18n/locales/{en_US,en_GB,i18n,iso14651_t1,iso14651
+_t1_common} .
$ mv -v en_US en_TESTING
Now edit the file en_TESTING and insert the following into the LC_COLLATE section after the copy "iso14651_t1" statement:
collating-element <bc> from "<U0062><U0063>"
script <TESTING>
order_start <TESTING>;forward;forward;forward;forward,position
<bc> "<a><c>";"<BAS><BAS>";"<MIN><MIN>";IGNORE
order_end
Now compile and test the locale:
$ localedef -i en_TESTING -f UTF-8 -c ./en_TESTING.UTF-8
$ LOCPATH=/tmp/localetest LC_ALL=en_TESTING.UTF-8 perl -Mlocale -le 'p
+rint for sort qw/ab a b bc/'
a
ab
bc
b
$ LOCPATH=/tmp/localetest LC_ALL=en_TESTING.UTF-8 perl -wMstrict -lMlo
+cale
print "1 $_" for "abcabbc"=~/${\join "|", reverse sort qw(b bc) }/g;
print "2 $_" for "abcabbc"=~/${\join "|", sort {length$b<=>length$a} q
+w(b bc) }/g;
__END__
1 b
1 b
1 b
2 bc
2 b
2 bc
Now of course the real locale definitions are long and I don't plan on reading all of them, and so it's entirely possible that currently no locales exist that define this kind of sorting order. But I'll just be a bit paranoid and stick with sorting by length explicitly :-)
Update: I did incorporate your suggestion about highlighting the fact that strings are matched literally, thank you! | [reply] [d/l] [select] |
|
In all my CS/IS career, I've been able get away with taking no more than a couple of quick steps into locale-land and then turning and scurrying back to safety. I will have to do a lot more reading and experimenting to be able to respond meaningfully to what you've posted, but it looks... interesting.
Give a man a fish: <%-{-{-{-<
| [reply] [d/l] [select] |
Re: [RFC] Building Regex Alternations Dynamically
by LanX (Saint) on Jan 18, 2017 at 13:56 UTC
|
it's almost TL;DR, so two separate posts my have been better. :)
Just three remarks:
- IIRC there is a hard limit for the length of a regex-string ... (or was it only the repetition counter .{x} limited to x aprox 32000 ?)
- "newer" Perls come with something called trie-optimization which will considerably speed up such or-clauses.
- Trie-optimization has definitely a hard limit, where it falls back to (much slower) old behaviour.
| [reply] [d/l] |
|
Hi LanX,
Thanks for your comments, I'll think about whether I can provide a TL;DR version - my thought was that impatient users could just copy-n-paste my code :-)
IIRC there is a hard limit for the length of a regex-string
I'm not sure, I'll have to look into it, but I do know this works :-)
my $word_regex = do {
open my $wfh, '<:utf8', '/usr/share/dict/words' or die $!;
chomp( my @words = <$wfh> );
close $wfh;
my $re = join '|',
map {quotemeta}
sort { length $b <=> length $a or $a cmp $b }
@words;
qr/$re/i };
print length($word_regex), "\n";
__END__
964836
Thanks, -- Hauke D | [reply] [d/l] |
|
> > IIRC there is a hard limit for the length of a regex-string
Maybe not in the string length but for sure in the processing.
something like /x{33000}/ will result in
Quantifier in {,} bigger than 32766 in regex; ...
But that's not exactly relevant for your case :)
| [reply] [d/l] [select] |
|
Hi Rolf,
IIRC there is a hard limit for the length of a regex-string
Other than RAM, I haven't found one yet, I just built a qr// regex from a 1GB text file without alternations and a second from 40MB of text with over 10M alternations.
As you may have seen I added a TL;DR, thanks for the idea!
Regards, -- Hauke D
| [reply] [d/l] |
|
Yeah sorry I had
> > (or was it only the repetition counter .{x} limited to x aprox 32000 ?)
in mind. Since regexes are compiled to op codes hard limits are unlikely.
> As you may have seen I added a TL;DR, thanks for the idea!
Hmm probably it's time to include support for TOC = table of content in my wiki add on :)
| [reply] |
Re: Building Regex Alternations Dynamically
by jo37 (Deacon) on May 17, 2020 at 09:46 UTC
|
It took some time until I discovered this jewel.
It's great. Thanks a lot!
One thought about the usage of map: Here it needs to be in list context, which is a bit odd when declaring a scalar.
In a version that has two characters more to type this can be avoided:
my $regex = sub { qr/@_/ }->(
join '|',
map {quotemeta}
sort { length $b <=> length $a }
@values);
print "$regex\n";
Greetings, -jo
$gryYup$d0ylprbpriprrYpkJl2xyl~rzg??P~5lp2hyl0p$
| [reply] [d/l] [select] |
|
| [reply] [d/l] |
Re: Building Regex Alternations Dynamically (fusselkerl)
by shmem (Chancellor) on Jun 16, 2020 at 17:05 UTC
|
#!/usr/bin/perl
# file fusselkerl
use strict;
die "usage: $0 pattern [file]\n" unless @ARGV;
my $p;
{
my (%s, %i);
my $d = my $c = 1; # our regexp will be inside parens, so first back
+ref is 2
$p = join (
"",
map {
if($s{$_}++){
"\\".$i{$_}
}
else{
$i{$_}=++$c;
$c>$d+1 ? '(?!'.join('|',map{"\\".abs}-$c+1..-$d-1).")(\\w)" :
+ "(\\w)";
}
} split //, shift
);
}
print '(',$p,")\n";
local $" = ', ';
my %s;
while (<>) {
my @l = ();
while (/\b($p)\b/g) {
push @l, $1 unless $s{$1}++;
}
print "@l\n" if @l;
}
try
perl fusselkerl fusselkerl /usr/share/dict/words
edit: oh, I posted that already years ago... Re: How to use a negative backreference in regex?, sorry ;)
perl -le'print map{pack c,($-++?1:13)+ord}split//,ESEL'
| [reply] [d/l] [select] |