Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

Re: Pattern Identification

by tybalt89 (Monsignor)
on Oct 01, 2017 at 00:05 UTC ( [id://1200435]=note: print w/replies, xml ) Need Help??


in reply to Pattern Identification

"fast and efficient" means, of course, Benchmark, which I'll leave up to you. Also, I'm not quite sure exactly how many is a "whole load". At some point this (or anything else, for that matter, everything has a breaking point) will fail.

#!/usr/bin/perl # http://perlmonks.org/?node_id=1200434 use strict; use warnings; use re 'eval'; my $patterns = <<'END'; ^\d{2}.\d{2}.\d{2}$ date ^\d{2}.\d{2}.\d{4}$ date ^[A-Z]{2}\d{9}[A-Z]{2}$ Royal Mail Track & Trace code ^\d{16}$ visa card ^\d{13}$ EAN-13 barcode END my $regex; sub patternidentification { if( not defined $regex ) { ##################### build a single regex just once my $all = join '|', map { /^(\S+)\s++(.+)/ ? "(?:$1(?{'$2'}))" : die "bad pattern $_ +" } split /\n/, $patterns; $regex = qr/$all/; } return /$regex/ ? $^R : "unknown"; } ##################### then try all matches while(<DATA>) { chomp; my $answer = patternidentification($_); print "$_ is a $answer\n"; } __DATA__ 12 12 17 09 30 2O17 09 30 2017 09 30 12017 123123123123123 1231231231231231 12312312312312312 456456456456 4564564564567 45645645645678 QW123456789WQ

Outputs:

12 12 17 is a date 09 30 2O17 is a unknown 09 30 2017 is a date 09 30 12017 is a unknown 123123123123123 is a unknown 1231231231231231 is a visa card 12312312312312312 is a unknown 456456456456 is a unknown 4564564564567 is a EAN-13 barcode 45645645645678 is a unknown QW123456789WQ is a Royal Mail Track & Trace code

Replies are listed 'Best First'.
Re^2: Pattern Identification
by WhiteTraveller (Novice) on Oct 01, 2017 at 12:24 UTC

    Hi Tybalt89

    Thank you. I am going to have to go away and consider this, as I am not sure that I understand exactly how this is working. You've concatenated all the different regex expressions into one string, whilst including the type string. The key points are "use re eval" and "map" -- neither of which I am familiar with. The latter appears to create a hash, which makes perfect sense, but I am going to have to understand what $all is all about before the penny drops.

      In general, map blocks can be rewritten like the following (I'm simplifying a little bit), although one would usually use a lexical loop variable like for my $elem (@in) ... instead of $_.

      my @out = map { ... } @in; # - becomes - my @out; for $_ (@in) { my @result = ...; push @out, @result; }

      The regex /^(\S+)\s++(.+)/ is splitting the input string on the first whitespace (it is equivalent to my ($left,$right) = split /\s+/, $str, 2;, see split). Using the ternary ?: operator, if the regex matches, the block of code will return the string "(?:$1(?{'$2'}))", and if it doesn't match, die is called. So in this case the map operation is not returning a hash (or a list of key-value pairs), but just one output string for each input string, the input strings being one line of the regexes each.

      So with the join '|', ..., as you said the code is constructing a single regex. The general process of doing so is something I discussed in my tutorial Building Regex Alternations Dynamically, but this one is a bit more specialized. For the names, tybalt89 is using a neat trick using (?{...}), which allows you to insert arbitrary code into a regular expression, the return value of the most recent code is then stored in the special variable $^R. The use re 'eval'; is necessary because these (?{}) blocks are being interpolated from strings into the regex, so this is a security feature of Perl.

      Consider this regex (I'm using the /x modifier for readability): m{ ^[a-zA-Z]\w+$ (?{'one'}) | ^[0-9]\w+$ (?{'two'}) }x. When matching against the string "3abc", it will match the second alternation, that is, ^[0-9]\w+$, and then it will execute (?{'two'}), and since the last value in that piece of code is 'two', that is what it returns and what $^R gets set to. After the regex has executed and matched successfully, you can simply look at the value of $^R to see which of the two patterns contained in the regex were matched.

      Minor edits for clarity.

      ... key points ... "map" [which] appears to create a hash ...

      No "hash" (in the sense of an associative array) is created at any point. The critical effect of the map expression is to extract the format regex ($1) and descriptive text ($2) substrings from each data type specifier record and use them to build a sub-regex for each data type. The  (?{'$2'}) sub-sub-regex generates code that evaluates the descriptive text substring and returns it via the  $^R regex special variable (see perlvar). All these sub-regexes are then concatenated together into one big alternation.

      You can just
          print $regex, "\n";
      and pick your way through the result to see the alternation of all the sub-regexes in all their glory.


      Give a man a fish:  <%-{-{-{-<

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1200435]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others examining the Monastery: (7)
As of 2024-04-18 09:09 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found