Re: Pattern Identification

"fast and efficient" means, of course, Benchmark, which I'll leave up to you. Also, I'm not quite sure exactly how many is a "whole load". At some point this (or anything else, for that matter, everything has a breaking point) will fail.

#!/usr/bin/perl

# http://perlmonks.org/?node_id=1200434

use strict;
use warnings;
use re 'eval';

my $patterns = <<'END';
^\d{2}.\d{2}.\d{2}$      date
^\d{2}.\d{2}.\d{4}$      date
^[A-Z]{2}\d{9}[A-Z]{2}$  Royal Mail Track & Trace code
^\d{16}$                 visa card
^\d{13}$                 EAN-13 barcode
END

my $regex;

sub patternidentification
  {
  if( not defined $regex )
    {
    ##################### build a single regex just once

    my $all = join '|',
      map { /^(\S+)\s++(.+)/ ? "(?:$1(?{'$2'}))" : die "bad pattern $_
+" }
      split /\n/, $patterns;

    $regex = qr/$all/;
    }
  return /$regex/ ? $^R : "unknown";
  }


##################### then try all matches

while(<DATA>)
  {
  chomp;
  my $answer = patternidentification($_);
  print "$_ is a $answer\n";
  }

__DATA__
12 12 17
09 30 2O17
09 30 2017
09 30 12017
123123123123123
1231231231231231
12312312312312312
456456456456
4564564564567
45645645645678
QW123456789WQ
[download]

Outputs:

12 12 17 is a date
09 30 2O17 is a unknown
09 30 2017 is a date
09 30 12017 is a unknown
123123123123123 is a unknown
1231231231231231 is a visa card
12312312312312312 is a unknown
456456456456 is a unknown
4564564564567 is a EAN-13 barcode
45645645645678 is a unknown
QW123456789WQ is a Royal Mail Track & Trace code
[download]

Comment on Re: Pattern Identification Select or Download Code

Replies are listed 'Best First'.
Re^2: Pattern Identification by WhiteTraveller (Novice) on Oct 01, 2017 at 12:24 UTC
Hi Tybalt89 Thank you. I am going to have to go away and consider this, as I am not sure that I understand exactly how this is working. You've concatenated all the different regex expressions into one string, whilst including the type string. The key points are "use re eval" and "map" -- neither of which I am familiar with. The latter appears to create a hash, which makes perfect sense, but I am going to have to understand what $all is all about before the penny drops.	[reply]
Re^3: Pattern Identification by haukex (Archbishop) on Oct 01, 2017 at 15:47 UTC
In general, map blocks can be rewritten like the following (I'm simplifying a little bit), although one would usually use a lexical loop variable like `for my $elem (@in) ...` instead of `$_`. `my @out = map { ... } @in; # - becomes - my @out; for $_ (@in) { my @result = ...; push @out, @result; }` [download] The regex `/^(\S+)\s++(.+)/` is splitting the input string on the first whitespace (it is equivalent to `my ($left,$right) = split /\s+/, $str, 2;`, see split). Using the ternary `?:` operator, if the regex matches, the block of code will return the string `"(?:$1(?{'$2'}))"`, and if it doesn't match, die is called. So in this case the map operation is not returning a hash (or a list of key-value pairs), but just one output string for each input string, the input strings being one line of the regexes each. So with the `join '\|', ...`, as you said the code is constructing a single regex. The general process of doing so is something I discussed in my tutorial Building Regex Alternations Dynamically, but this one is a bit more specialized. For the names, tybalt89 is using a neat trick using `(?{...})`, which allows you to insert arbitrary code into a regular expression, the return value of the most recent code is then stored in the special variable $^R. The `use re 'eval';` is necessary because these `(?{})` blocks are being interpolated from strings into the regex, so this is a security feature of Perl. Consider this regex (I'm using the `/x` modifier for readability): `m{ ^[a-zA-Z]\w+$ (?{'one'}) \| ^[0-9]\w+$ (?{'two'}) }x`. When matching against the string `"3abc"`, it will match the second alternation, that is, `^[0-9]\w+$`, and then it will execute `(?{'two'})`, and since the last value in that piece of code is `'two'`, that is what it returns and what `$^R` gets set to. After the regex has executed and matched successfully, you can simply look at the value of `$^R` to see which of the two patterns contained in the regex were matched. Minor edits for clarity.	[reply] [d/l] [select]
Re^3: Pattern Identification by AnomalousMonk (Archbishop) on Oct 01, 2017 at 15:27 UTC
... key points ... "map" [which] appears to create a hash ... No "hash" (in the sense of an associative array) is created at any point. The critical effect of the map expression is to extract the format regex (`$1`) and descriptive text (`$2`) substrings from each data type specifier record and use them to build a sub-regex for each data type. The `(?{'$2'})` sub-sub-regex generates code that evaluates the descriptive text substring and returns it via the `$^R` regex special variable (see perlvar). All these sub-regexes are then concatenated together into one big alternation. You can just `print $regex, "\n";` and pick your way through the result to see the alternation of all the sub-regexes in all their glory. Give a man a fish: `<%-{-{-{-<`	[reply] [d/l] [select]


Perl-Sensitive Sunglasses
	PerlMonks