Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

Building Regex Alternations Dynamically

by haukex (Abbot)
on Jan 18, 2017 at 12:57 UTC ( #1179840=perlmeditation: print w/replies, xml ) Need Help??

TL;DR: The two code samples below are working pieces of code that can be copied into your Perl script and adapted for your purposes.

I thought it might be useful to explain the technique of building regular expressions dynamically from a set of strings. Let's say you have a list of strings, like ('abc.', 'd$ef', 'gh|i') (that's a $ character, not a variable), and you want to build a regex that matches any of them, like /(?:abc\.|d\$ef|gh\|i)/ - note how the special characters are escaped with backslashes so they lose their special meanings and will be matched literally (more on this below). This also works well with s/search/replacement/ if you have a hash where the keys are the search strings and the values are the replacements, as I'll show below. If you're uncertain on some of the regex concepts used here, like alternations a|b and non-capturing groups (?:...), I recommend perlretut.

First, the basic code, which I explain below - note the numbering on the lines of code.

my @values = qw/ a ab. d ef def g|h /; my ($regex) = map { qr/$_/ } # 5. join '|', # 4. map {quotemeta} # 3. sort { length $b <=> length $a } # 2. @values; # 1. print "$regex\n"; # 6.
  1. We begin with the list of strings stored in the array @values. This could be any list, such as a literal qw/.../, or return values from functions, including keys or values.
  2. We sort the list so that the longer strings appear first. This is necessary because if we didn't do this and our regular expression was /foo|foobar/, then applied to the string "foobarfoofoobar", it would only match "foo" three times, and never "foobar". But if the regex is /foobar|foo/, then it would correctly match "foobar", "foo", and again "foobar".
  3. Next, we apply the quotemeta function to each string, which escapes any metacharacters that might have special meaning in a regex, such as . (dot, matches anything), | (alternation operator), or $ (anchor to end of line/string). In our example, we want the string "g|h" to be matched literally, and not to mean "match g or h". Unescaped metacharacters can also break the syntax of the regex, like stray opening parentheses or similar. Note that quotemeta is the same as using \Q...\E in a regex. As discussed here, you should only drop \Q...\E or quotemeta in the case that you explicitly want metacharacters in your input strings to be special, they come from a trusted source, and you are certain that your strings don't contain any characters that would break your regular expression or expose security holes!
  4. Then, we join the strings into one long string with the regex alternation operator | in between each string. The string returned by join in the example looks like this: ab\.|def|g\|h|ef|a|d
  5. This step compiles the regular expression using qr//. If you want to add modifiers such as /i (case-insensitive matching), this would be the place to do it, as in qr/$_/i. This line of code needs a bit of explanation: join from the previous step will return a single string, and so the map will evaluate its code block { qr/$_/ } once, with $_ being the string returned by join. The parentheses in my ($regex) = are required so that map will return the value from its code block (map in "list context"), instead of a count of the values (map in "scalar context"). Context in Perl is a topic for another tutorial.
  6. When we print the regular expression, we see that it has become this:

    (?^:ab\.|def|g\|h|ef|a|d)

    You can now use this precompiled regular expression anywhere, as explained in Compiling and saving regular expressions and perlop, such as:

    if ($input=~$regex) { print "It matches!\n"; } # or while ($input=~/($regex)/g) { print "Found string: $1\n"; }

    Note that the qr// operator has implicitly added a non-capturing group (?:...) around the regular expression. This is important when you want to use the regular expression we've just built as part of a larger expression. For example, if your input strings are qw/a b c/ and you write /^$regex$/, then what you probably meant is /^(?:a|b|c)$/. If the non-capturing group wasn't there, then the regex would look like this: /^a|b|c$/, which means "match a only at the beginning of the string, or b anywhere in the string, or c only at the end of the string", which is probably not what you meant!

Search and Replace Using a Hash

my %map = ( a=>1, ab=>23, cd=>45 ); # 1. my ($regex) = map { qr/$_/ } # 2. join '|', map {quotemeta} sort { length $b <=> length $a or $a cmp $b } # 3. keys %map; print "$regex\n"; # 4. # Now, use the regex my @strings = qw/ abcd aacd abaab /; # 5. for (@strings) { my $before = $_; s/($regex)/$map{$1}/g; # 6. print "$before -> $_\n"; # 7. }
  1. This is the hash in which the keys are the search strings, and the values are the replacements. As above, this can come from any source.
  2. This code to build the regex is mostly the same as the above, with the following difference:
  3. Instead of only sorting by length, this sort first sorts by length, and sorts values with the same length with a stringwise sort. While not strictly necessary, I would recommend this because hashes are unordered by default, meaning that your regex would be in a different order across different runs of the program. Sorting the hash keys like this causes the regex to be in the same order in every run of the program.
  4. We print the regex for debugging, and see that it looks like this: (?^:ab|cd|a)
  5. These @strings are the test strings we will apply the regular expression against.
  6. This is the search and replace operation that matches the keys of the hash, and as a replacement value gets the corresponding value from the hash. Note that the /g modifier is not strictly required (s///g will replace all matches in the string, not just the first), and you can adapt this regex any way you like. So for example, to only make one replacement anchored at the beginning of the string, you can say s/^($regex)/$map{$1}/;.
  7. The output of the code is:
    abcd -> 2345 aacd -> 1145 abaab -> 23123

Thank you to all those who replied to this post as well as this one, in particular thanks to kcott, LanX, AnomalousMonk, and Haarg, whose suggestions ended up in the above!

Hope this helps,
-- Hauke D

Updates: 2017-05-14: Merged in the draft text I previously had in this node, made several updates to the text, and removed the "RFC" tag from the title.

Replies are listed 'Best First'.
Re: [RFC] Building Regex Alternations Dynamically
by kcott (Chancellor) on Jan 19, 2017 at 10:11 UTC

    G'day Hauke,

    ++ This looks like a very good start; seems reasonably complete; and covers most of the points I might have made. I have comments on two areas, as follows.

    With any sort of tutorial, those reading it — to learn about the subject, rather than for reviewing, proof-reading, etc. — probably start with limited knowledge. Accordingly, any terms used should be unambiguous; unfortunately, you've used $regex to mean two different things:

    $regex = join ... $regex = qr/...

    I'm familiar with both the subject matter and the technique, so this posed no problem for me; however, for someone learning this, it may do. While it's reasonably obvious in the short code example, half a page later, in the middle of descriptive text, the appearance of $regex might not be as obvious to the student as it is to you or I. Consider renaming those; purely as a suggestion:

    $regex_base_str = join ... $regex_compiled = qr/...

    In points (4) & (5), in the first list, you show grouping. To resolve the same issue in both, you use explicit capture grouping in (4), and implicit non-capture grouping in (5).

    Regex pieces used for alternation often occur as part of a larger regex; in fact, I suspect that's the more usual case. This may be as simple as the anchor assertions you show in (4), or could be a lot more complex. I'd suggest adding explicit non-capturing grouping to $regex_base_str (or whatever you call it) as part of the normal technique. To demonstrate:

    # Simple case: OK - matches "a" or "b" $ perl -E 'my $re = "a|b"; $re = qr{$re}; say $re' (?^u:a|b) # Complex case: NOT OK - matches "Xa" or "bY" $ perl -E 'my $re = "a|b"; $re = qr{X${re}Y}; say $re' (?^u:Xa|bY) # Complex case: OK - matches "a" or "b" [fixed with "(?:...)"] $ perl -E 'my $re = "(?:a|b)"; $re = qr{X${re}Y}; say $re' (?^u:X(?:a|b)Y)

    — Ken

      Hi Ken,

      Thank you very much for your thoughtful reply!

      With any sort of tutorial, those reading it — to learn about the subject, rather than for reviewing, proof-reading, etc. — probably start with limited knowledge. Accordingly, any terms used should be unambiguous; unfortunately, you've used $regex to mean two different things

      Excellent point. I've renamed the variables to disambiguate (I kept the names shorter though), and I've added a link to perlretut.

      In points (4) & (5), in the first list, you show grouping. To resolve the same issue in both, you use explicit capture grouping in (4), and implicit non-capture grouping in (5).

      Another excellent point, I've switched to using the non-capturing groups in all the examples.

      I'd suggest adding explicit non-capturing grouping to $regex_base_str (or whatever you call it) as part of the normal technique.

      That is a very good point, but I haven't made the change yet because I need to think on it a bit more. On the one hand, I think that adding an extra (?:...) makes the generated regex look a little more complex than it needs to be (qr/(?:a|b)/ eq "(?^:(?:a|b))" and on older Perls qr/(?:a|b)/ eq "(?-xism:(?:a|b))"), and also it makes the code to generate the string less elegant (my $regex_str = '(?:'.join('|', map ... ).')'). But those are just stylistic concerns and you're right that it would eliminate the pitfall that I have to discuss at length in points 4 and 5.

      The other potential solution, which I'm currently leaning towards, is to skip the intermediate string variable, like in Haarg's post here: my ($regex) = map {qr/$_/} join '|', map .... I like this latter approach better because it's more robust (no string for the user to potentially misuse), but it does add one more "trick" that has to be explained to the beginner. Currently I feel that the advantages of that outweigh the disadvantages...

      Update 2, 2017-05-14: This node used to contain a draft, which I've now incorporated into the root node. I wanted to preserve the original text here:

      Thanks,
      -- Hauke D

        (The following is perhaps a bit OT to the main thread, or else already touched upon in an update. Oh, well...)

        ... I think that adding an extra (?:...) makes the generated regex look a little more complex than it needs to be ... those are just stylistic concerns ...

        Please see Re: Recognizing 3 and 4 digit number and thereunder for a long discussion between myself and kcott on these "stylistic concerns." Personally, I still don't see the need for the extra explicit  (?:...) wrap step. The implicit wrap becomes explicit quick enough if you print the stringized Regexp object, and this feature of a Regexp object should be deeply understood from the moment one begins to use them.

        ... skip the intermediate string variable, like in Haarg's post here: my ($regex) = map {qr/$_/} join '|', map .... ... it's more robust ...

        To continue the previous point, I feel it's important to get a regex into a Regexp form as quickly as possible: no dilly-dallying. Once objectified, it can be used atomically when composing more complex regex expressions:

        my $rx = qr{ ... }xms; ... $string =~ m{ $rx* $rx+? $rx{3,4} }xms and do_something();
        (Of course, this compositional capability is also addressed by the  (DEFINE) predicate of the  (?(condition)yes-pattern) conditional expression of Perl 5.10+.)

        The only situation which I'm aware of in which this compositional atomicity breaks down is for something like

        my ($n, $m) = (3, 4); ... $string =~ m{ $rx{$n,$m} }xms and do_something();
        where  $rx{$n} $rx{$n,} $rx{$n,$m} are all taken as hash element accesses. This can be fixed simply by an explicit layer of non-capturing group wrapping (entirely necessary here!):
            (?:$rx){$n} (?:$rx){$n,} (?:$rx){$n,$m}


        Give a man a fish:  <%-{-{-{-<

Re: [RFC] Building Regex Alternations Dynamically
by LanX (Bishop) on Jan 18, 2017 at 13:56 UTC
    it's almost TL;DR, so two separate posts my have been better. :)

    Just three remarks:

    • IIRC there is a hard limit for the length of a regex-string ... (or was it only the repetition counter .{x} limited to x aprox 32000 ?)
    • "newer" Perls come with something called trie-optimization which will considerably speed up such or-clauses.
    • Trie-optimization has definitely a hard limit, where it falls back to (much slower) old behaviour.

    Cheers Rolf
    (addicted to the Perl Programming Language and ☆☆☆☆ :)
    Je suis Charlie!

      Hi LanX,

      Thanks for your comments, I'll think about whether I can provide a TL;DR version - my thought was that impatient users could just copy-n-paste my code :-)

      IIRC there is a hard limit for the length of a regex-string

      I'm not sure, I'll have to look into it, but I do know this works :-)

      my $word_regex = do { open my $wfh, '<:utf8', '/usr/share/dict/words' or die $!; chomp( my @words = <$wfh> ); close $wfh; my $re = join '|', map {quotemeta} sort { length $b <=> length $a or $a cmp $b } @words; qr/$re/i }; print length($word_regex), "\n"; __END__ 964836

      Thanks,
      -- Hauke D

        > > IIRC there is a hard limit for the length of a regex-string

        Maybe not in the string length but for sure in the processing.

        something like /x{33000}/ will result in

        Quantifier in {,} bigger than 32766 in regex; ...

        But that's not exactly relevant for your case :)

        Cheers Rolf
        (addicted to the Perl Programming Language and ☆☆☆☆ :)
        Je suis Charlie!

      Hi Rolf,

      IIRC there is a hard limit for the length of a regex-string

      Other than RAM, I haven't found one yet, I just built a qr// regex from a 1GB text file without alternations and a second from 40MB of text with over 10M alternations.

      As you may have seen I added a TL;DR, thanks for the idea!

      Regards,
      -- Hauke D

        Yeah sorry I had

        > > (or was it only the repetition counter .{x} limited to x aprox 32000 ?)

        in mind. Since regexes are compiled to op codes hard limits are unlikely.

        > As you may have seen I added a TL;DR, thanks for the idea!

        Hmm probably it's time to include support for TOC = table of content in my wiki add on :)

        Cheers Rolf
        (addicted to the Perl Programming Language and ☆☆☆☆ :)
        Je suis Charlie!

Re: [RFC] Building Regex Alternations Dynamically
by duelafn (Vicar) on Jan 18, 2017 at 16:14 UTC
Re: [RFC] Building Regex Alternations Dynamically
by Discipulus (Monsignor) on Jan 19, 2017 at 08:33 UTC
    double (if possible) ++haukex and immediatlely bookmarked!

    In my opinion is not too much to read. Even if there are two arguments discussed they are similar and explained very well in such a plain way that everyone (including me!) can understand. The order of comments let's even a beginner to follow the flow of regex building.



    L*

    There are no rules, there are no thumbs..
    Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.
Re: [RFC] Building Regex Alternations Dynamically
by AnomalousMonk (Chancellor) on Jan 19, 2017 at 19:08 UTC
    Let's say you have a list of strings, like ("abc", "def", "ghi") ...

    I think it's very important to highlight the fact that these strings are assumed to be literal strings or their equivalent, and cannot contain regex "operators" per se. Perhaps "Let's say you have a list of literal strings or their moral equivalent, like ..." You make this point further on in point 3 in the first group of bullet points, but I think it should be emphasized early and often.

    sort { length $b <=> length $a }  # 2.
    sort { length $b <=> length $a
           or $a cmp $b }             # 3.

    Your discussion of sorting by length in an alternation of this kind is critical. I feel the desired end can be achieved in a more uniform way, although it's a bit more tricky to explain.

    Say we have the set  qw(ab abc abcd wx wxy wxyz) of literal strings we wish to build into an alternation. To build a "longest match" ordered alternation, it's essential to have  abcd appear before  abc and  abc before  ab in the alternation as you've explained, but it's irrelevant where  wx wxy wxyz appear relative to the former subgroup. The reason is that nothing in the literal  wx wxy wxyz subgroup can possibly match anything that the literal  ab abc abcd subgroup matches and vice versa. So a standard
        reverse sort
    expression serves in all cases.

    c:\@Work\Perl\monks>perl -wMstrict -le "my ($alt) = map qr{$_}xms, join '|', map quotemeta, reverse sort qw(wx ab wxy abc wxyz abcd) ; print $alt; " (?^msx:wxyz|wxy|wx|abcd|abc|ab)
    Does it ever matter that members of the  wx wxy wxyz subgroup fall before, after, or within the  ab abc abcd subgroup? Never. The regex
        (?^msx:abcd|wxyz|wxy|abc|ab|wx)
    is functionally equivalent (in terms of longest matching) to
        (?^msx:wxyz|wxy|wx|abcd|abc|ab)

    In addition, there's the whole  $b/$a versus  $a/$b issue in controlling descending versus ascending sorting. This is very easy to screw up and can be very hard to see to fix;  reverse sort eliminates this ambiguity.

    A situation in which a performance difference might emerge is if it were known that, e.g.,  ab was overwhelmingly more likely to occur than  wx and so should be tested first (along with all its cohort) to avoid needlessly burning computrons in a heavy-duty matching situation:
        (?^msx:abcd|abc|ab|wxyz|wxy|wx)
    But this is a situation that requires carefully hand-crafting a regex, and may be outside the scope of your article.

    I tend to use a  reverse sort step in all cases when I'm building this type of alternation, even when all substrings to be matched are known to be mutually exclusive or when the alternation will be anchored in such a way that alternation match length is not a consideration. (Of course, if the programer wants a shortest match alternation, that's an entirely different question, and also a point upon which you've touched.)


    Give a man a fish:  <%-{-{-{-<

      So a standard reverse sort expression serves in all cases.

      As I /msg'ed you at the time you posted this, I had a doubt about this - and now I've finally found the time to confirm that I wasn't being entirely paranoid ;-) Under locale, it's possible for longer strings to be sorted before shorter ones. Here are the instructions to reproduce (on a Debian-based system):

      $ mkdir -v /tmp/localetest; cd /tmp/localetest # copy "en_US" and the files it refers to via its "copy" statements $ cp -v /usr/share/i18n/locales/{en_US,en_GB,i18n,iso14651_t1,iso14651 +_t1_common} . $ mv -v en_US en_TESTING

      Now edit the file en_TESTING and insert the following into the LC_COLLATE section after the copy "iso14651_t1" statement:

      collating-element <bc> from "<U0062><U0063>" script <TESTING> order_start <TESTING>;forward;forward;forward;forward,position <bc> "<a><c>";"<BAS><BAS>";"<MIN><MIN>";IGNORE order_end

      Now compile and test the locale:

      $ localedef -i en_TESTING -f UTF-8 -c ./en_TESTING.UTF-8 $ LOCPATH=/tmp/localetest LC_ALL=en_TESTING.UTF-8 perl -Mlocale -le 'p +rint for sort qw/ab a b bc/' a ab bc b $ LOCPATH=/tmp/localetest LC_ALL=en_TESTING.UTF-8 perl -wMstrict -lMlo +cale print "1 $_" for "abcabbc"=~/${\join "|", reverse sort qw(b bc) }/g; print "2 $_" for "abcabbc"=~/${\join "|", sort {length$b<=>length$a} q +w(b bc) }/g; __END__ 1 b 1 b 1 b 2 bc 2 b 2 bc

      Now of course the real locale definitions are long and I don't plan on reading all of them, and so it's entirely possible that currently no locales exist that define this kind of sorting order. But I'll just be a bit paranoid and stick with sorting by length explicitly :-)

      Update: I did incorporate your suggestion about highlighting the fact that strings are matched literally, thank you!

        In all my CS/IS career, I've been able get away with taking no more than a couple of quick steps into locale-land and then turning and scurrying back to safety. I will have to do a lot more reading and experimenting to be able to respond meaningfully to what you've posted, but it looks... interesting.


        Give a man a fish:  <%-{-{-{-<

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlmeditation [id://1179840]
Front-paged by Athanasius
help
Chatterbox?
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others scrutinizing the Monastery: (5)
As of 2017-12-18 18:50 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    What programming language do you hate the most?




















    Results (495 votes). Check out past polls.

    Notices?