WhiteTraveller has asked for the wisdom of the Perl Monks concerning the following question:
Hi all. I've got a problem here, and I am not at all sure how I should be approaching it.
What I have is a whole load of defined formats for specific data types. For example:
^\d{2}.\d{2}.\d{2}$ date
^\d{2}.\d{2}.\d{4}$ date
^[A-Z]{2}\d{9}[A-Z]{2}$ Royal Mail Track & Trace code
^\d{16}$ visa card
^\d{13}$ EAN-13 barcode
...
What I want to do is to pass a string into a subroutine, where the string is compared to the definition list, and to return a value indicating what the string is, or represents. Obviously, the list above is very simplistic, as not every 13 digit number will be a valid barcode -- so there will be additional checks performed if it matches, so that it will also recognise and differentiate between subtypes (ISBN numbers, for example).
What I would like to know is how I should be doing this whilst ensuring that it is reasonably fast and efficient. Help would be very much appreciated.
Re: Pattern Identification
by tybalt89 (Monsignor) on Oct 01, 2017 at 00:05 UTC
|
"fast and efficient" means, of course, Benchmark, which I'll leave up to you. Also, I'm not quite sure
exactly how many is a "whole load". At some point this (or anything else, for that matter, everything has a breaking point) will fail.
#!/usr/bin/perl
# http://perlmonks.org/?node_id=1200434
use strict;
use warnings;
use re 'eval';
my $patterns = <<'END';
^\d{2}.\d{2}.\d{2}$ date
^\d{2}.\d{2}.\d{4}$ date
^[A-Z]{2}\d{9}[A-Z]{2}$ Royal Mail Track & Trace code
^\d{16}$ visa card
^\d{13}$ EAN-13 barcode
END
my $regex;
sub patternidentification
{
if( not defined $regex )
{
##################### build a single regex just once
my $all = join '|',
map { /^(\S+)\s++(.+)/ ? "(?:$1(?{'$2'}))" : die "bad pattern $_
+" }
split /\n/, $patterns;
$regex = qr/$all/;
}
return /$regex/ ? $^R : "unknown";
}
##################### then try all matches
while(<DATA>)
{
chomp;
my $answer = patternidentification($_);
print "$_ is a $answer\n";
}
__DATA__
12 12 17
09 30 2O17
09 30 2017
09 30 12017
123123123123123
1231231231231231
12312312312312312
456456456456
4564564564567
45645645645678
QW123456789WQ
Outputs:
12 12 17 is a date
09 30 2O17 is a unknown
09 30 2017 is a date
09 30 12017 is a unknown
123123123123123 is a unknown
1231231231231231 is a visa card
12312312312312312 is a unknown
456456456456 is a unknown
4564564564567 is a EAN-13 barcode
45645645645678 is a unknown
QW123456789WQ is a Royal Mail Track & Trace code
| [reply] [d/l] [select] |
|
Hi Tybalt89
Thank you. I am going to have to go away and consider this, as I am not sure that I understand exactly how this is working. You've concatenated all the different regex expressions into one string, whilst including the type string. The key points are "use re eval" and "map" -- neither of which I am familiar with. The latter appears to create a hash, which makes perfect sense, but I am going to have to understand what $all is all about before the penny drops.
| [reply] |
|
my @out = map { ... } @in;
# - becomes -
my @out;
for $_ (@in) {
my @result = ...;
push @out, @result;
}
The regex /^(\S+)\s++(.+)/ is splitting the input string on the first whitespace (it is equivalent to my ($left,$right) = split /\s+/, $str, 2;, see split). Using the ternary ?: operator, if the regex matches, the block of code will return the string "(?:$1(?{'$2'}))", and if it doesn't match, die is called. So in this case the map operation is not returning a hash (or a list of key-value pairs), but just one output string for each input string, the input strings being one line of the regexes each.
So with the join '|', ..., as you said the code is constructing a single regex. The general process of doing so is something I discussed in my tutorial Building Regex Alternations Dynamically, but this one is a bit more specialized. For the names, tybalt89 is using a neat trick using (?{...}), which allows you to insert arbitrary code into a regular expression, the return value of the most recent code is then stored in the special variable $^R. The use re 'eval'; is necessary because these (?{}) blocks are being interpolated from strings into the regex, so this is a security feature of Perl.
Consider this regex (I'm using the /x modifier for readability): m{ ^[a-zA-Z]\w+$ (?{'one'}) | ^[0-9]\w+$ (?{'two'}) }x. When matching against the string "3abc", it will match the second alternation, that is, ^[0-9]\w+$, and then it will execute (?{'two'}), and since the last value in that piece of code is 'two', that is what it returns and what $^R gets set to. After the regex has executed and matched successfully, you can simply look at the value of $^R to see which of the two patterns contained in the regex were matched.
Minor edits for clarity. | [reply] [d/l] [select] |
|
... key points ... "map" [which] appears to create a hash ...
No "hash" (in the sense of an associative array) is created at any point. The critical effect of the map expression is to extract the format regex ($1) and descriptive text ($2) substrings from each data type specifier record and use them to build a sub-regex for each data type. The (?{'$2'}) sub-sub-regex generates code that evaluates the descriptive text substring and returns it via the $^R regex special variable (see perlvar). All these sub-regexes are then concatenated together into one big alternation.
You can just
print $regex, "\n";
and pick your way through the result to see the alternation of all the sub-regexes in all their glory.
Give a man a fish: <%-{-{-{-<
| [reply] [d/l] [select] |
Re: Pattern Identification
by afoken (Chancellor) on Oct 01, 2017 at 09:44 UTC
|
^[A-Z]{2}\d{9}[A-Z]{2}$ Royal Mail Track & Trace code
^\d{16}$ visa card
^\d{13}$ EAN-13 barcode
It seems you are working with a barcode scanner. All barcode scanners I've seen so far can be configured to add prefixes or suffixes to indicate the type of barcode scanned. This could be used here. By enabling a barcode type prefix, you could better check for what has been scanned: A 13-digit number is just junk unless it is prefixed with the "this is a EAN barcode" prefix, in that case, it is really an EAN 13 barcode. If you see instead e.g. a Codabar prefix and a 13 digit number, whatever you scanned was not a EAN-13.
This works extremely well, I know this feature is used in blood banks to prevent misreadings of wrong labels.
The first manual I found (https://www.prosoft.ru/cms/f/426190.pdf) contains a long list of barcodes and identifier prefixes on page 25 (PDF page 26), one "AIM standard" ("]" + one upper case letter + one digit) and one "Datalogic Standard" (singe upper or lower case letter), and it suggests that you can reconfigure the codes.
The list shows several different identifiers for EAN, as you can have EANs in 8 or 13 digits, and they can have two different, mutual exlusive add-ons, resulting in six identifiers.
Another thing to consider is to configure the (USB) barcode scanner not to emulate a keyboard, but instead to present itself as a serial device. This prevents manual "scanning" by typing in data via the keyboard. The application that uses the scanner needs to be changed a little bit, but it avoids the largest source of errors: The human using the computer. As a nice side effect, it also avoids problems with Caps Lock, Num Lock, and non-US keyboard layouts.
Alexander
--
Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)
| [reply] [d/l] |
|
| [reply] |
Re: Pattern Identification
by AnomalousMonk (Archbishop) on Oct 01, 2017 at 02:00 UTC
|
Note that . (dot) is a metacharacter | regex metacharacter (matching anything except a newline unless the /s regex modifier is asserted, in which case it matches anything), so tybalt89's solution also identifies '12x12y17' and '09_30^2017' as dates.
You need to decide how dot is to be interpreted in your specifiers because another step seems to be needed (unless this is one of the downstream checks you envision).
Give a man a fish: <%-{-{-{-<
| [reply] [d/l] [select] |
Re: Pattern Identification
by BillKSmith (Monsignor) on Oct 01, 2017 at 13:31 UTC
|
The following form, suggested by the book "Perl Best Practices" (ISBN 0596001738), is easy to understand and maintain. If speed is truly important, it should be a candidate in your benchmark.
use strict;
use warnings;
sub string_type {
$_ = shift;
my $type
# format ? type
= /^\d{2}.\d{2}.\d{2}$/ ? 'date'
: /^\d{2}.\d{2}.\d{4}$/ ? 'date'
: /^[A-Z]{2}\d{9}[A-Z]{2}$/ ? 'Royal Mail Track & Trace code
+'
: /^\d{16}$/ ? 'visa card'
: /^\d{13}$/ ? 'EAN-13 barcode'
: 'unrecoginized string'
;
if ($type eq 'EAN-13 barcode') {
... # Handle sub-types
}
return $type;
}
| [reply] [d/l] |
Re: Pattern Identification
by TheDamian (Vicar) on Oct 01, 2017 at 20:44 UTC
|
Here's another approach, which I find clean and easy to maintain.
It uses named captures to label each alternative.
If the regex matches, the named capture hash (%+)
will have only a single key, which will be the label
of the alternative that matched.
This approach should be reasonably fast but, as others have said,
you'll definitely need to benchmark to find the most efficient
solution (which might even vary depending on which version
of Perl you're running).
use 5.010;
use strict;
use warnings;
my $CLASSIFIER = qr{
(?<date> \d{2}.\d{2}.\d{2} )
| (?<date> \d{2}.\d{2}.\d{4} )
| (?<RMTTC> [A-Z]{2}\d{9}[A-Z]{2} )
| (?<Visa> \d{16} )
| (?<EAN13> \d{13} )
}x;
while (my $input = <DATA>) {
chomp $input;
my $answer = $input =~ /^$CLASSIFIER$/ ? (keys %+)[0] : 'unknown';
print "$input is $answer\n";
}
__DATA__
12 12 17
09 30 2O17
09 30 2017
09 30 12017
123123123123123
1231231231231231
12312312312312312
456456456456
4564564564567
45645645645678
QW123456789WQ
| [reply] [d/l] [select] |
|
|