Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

Re^7: Regex Modification

by AnomalousMonk (Archbishop)
on Apr 13, 2013 at 20:07 UTC ( [id://1028549]=note: print w/replies, xml ) Need Help??


in reply to Re^6: Regex Modification
in thread Regex Modification

Below is an updated version of the regex. It is simplified a little, and an error is corrected. (Update: And it now matches something like 'x123-456789123-456x'.) I am still less than happy with it: it is over-complicated (Update: and it uses package variables), and it is not standalone because of its use of embedded capture groups that make it sensitive to the presence of other capture groups if it is used in combination with other regexes.

In any event, it works. Please see the embedded comments for a brief explanation of how the regex works, and see perlre and perlretut for more detailed info. The  m1() test function returns the number of matches in a string if called in scalar context, and a list of all the matching sub-strings if called in list context. If you have more questions, please let me know. As before, HTH.

Code:

use warnings FATAL => 'all' ; use strict; use Test::More # tests => ?? + 1 # Test::NoWarnings adds 1 test 'no_plan' ; use Test::NoWarnings; my @matches = ( '123 456 789', qw(123.456.789 123-456-789), qw(123-456789 123456-789 123-456789123456 123456789123-456 123-4-5-6-7-8-9-1-2-3-456 123-4-5-6-789 1234-56789 123456789-123456 123456789 123456789123456), qw(x123-456-789x x123456-789x x123456789x), qw(x123-456789123-456x x123456-789123456x x123456789123456x), ); my @no_matches = ( # none shall pass! '', qw(a abcdefgh abcdefghi abcdefghijklmno abcdefghijklmnop - -------- --------- --------------- ---------------- 12345678 1234567891234567 123-45-678 123-4567891234-567 -123456789 123456789- 1-23456789 12345678-9 1-2345678-9 12-3456789 1234567-89 12-34567-89 -1234567891234567 1234567891234567- 1-234567891234567 123456789123456-7 12345678-91234567 12-34567891234567 12345678912345-67 1--23456789 12--3456789 123--456789 1234--56789 123456--789 1234567--89 12345678--9 1--234567891234567 12--34567891234567 123--4567891234567 1234567--891234567 1234567891234--567 12345678912345--67 123456789123456--7 1--23456789123456 12--3456789123456 123--456789123456 1234567--89123456 1234567891234--56 12345678912345--6 ), ); # one big string from which to extract embedded sub-strings. my $xtr_string = q{ x100 456 789x101.456.789 x102-456-789 x103-456789xx104456-789 xx105-456789123456 x106456789123-456 x107-4-5-6-7-8-9-1-2-3-456 x108-4-5-6-789 x1094-56789 (110456789-123456), (111456789) (112456789123456,) no sub-string after this point should be matched/extracted a abcdefgh abcdefghi abcdefghijklmno abcdefghijklmnop - -------- --------- --------------- ---------------- x99945678x x9994567891234567x x999-45-678x x999-4567891234-567x x-999456789x x999456789-x x9-99456789x x99945678-9x x99-9456789x x9994567-89x x-9994567891234567x x9994567891234567-x x9-994567891234567x x999456789123456-7x x99-94567891234567x x99945678912345-67x }; # sub-strings will be extracted in order from string above. my $ar_xtr_list = [ '100 456 789', qw(101.456.789 102-456-789 103-456789 104456-789 105-456789123456 106456789123-456 107-4-5-6-7-8-9-1-2-3-456 108-4-5-6-789 1094-56789 110456789-123456 111456789 112456789123456), ]; MATCHER: for my $matcher (qw(m1)) { note "\n-------- matching with $matcher() --------\n\n"; *match = do { no strict 'refs'; *$matcher; }; note "ALL the following should match"; for my $n (@matches) { my $m = match($n); ok $m, qq{match: '$n'}; } note "NONE of the following should match"; for my $n (@no_matches) { my $no_m = ! match($n); ok $no_m, qq{NO match: '$n'}; } } # end for MATCHER is_deeply [ m1($xtr_string) ], $ar_xtr_list, qq{list extraction}; # subroutines ###################################################### sub m1 { my ($string, ) = @_; my $diff = qr{ [-. ] }xms; # differentiator chars my $d_min = qr{ \d{3,} }xms; # minimum group of digits local our ($digits, $diffs); use re 'eval'; my $ndn = qr{ # cannot begin after digit or any differentiator char (?<! \d) (?<! $diff) ($d_min # begin group 1 (main pattern) with minimum digits ($diff?) # group 2: 1 or 0 differentiator chars. # begin counting differentiator chars if one was present (?{ $diffs = length $^N; }) # (?{ print qq{A: \$^N '$^N' diffs $diffs \n} }) # match digit(s)/single-diff groups, count each diff (?: \d+ \g{-1} (?= \d) (?{ $diffs && ++$diffs }) # (?{ print qq{B: n_diffs $diffs \n} }) )* # end group 1 (the main pattern match) $d_min) # main pattern cannot be followed by a digit... (?! \d) # ...or by the diff char, or by any diff char if none present (?(?{ $diffs }) (?! \g{-1}) | (?! $diff)) # (?{ print qq{C: \$-[1] $-[1] \$+[1] $+[1] diffs $diffs \n} + }) # qualify potential main pattern for digits/diffs (?(?{ $digits = $+[1] - $-[1] - $diffs; # print qq{D: \$-[1] $-[1] \$+[1] $+[1] \$1 '$1' }, # qq{\$^N '$^N' \$2 '$2' digits $digits }, # qq{diffs $diffs \n}; $diffs > 10 || $digits > 15 || $digits < 9 }) (*FAIL)) }xms; # return only group 1 captures. return do { my $i; grep ++$i % 2, $string =~ m{ $ndn }xmsg; }; } # end sub m1()

Output:

c:\@Work\Perl\monks\Anonymous Monk\1027898>perl extract_9-15_digits_4. +pl # # -------- matching with m1() -------- # # ALL the following should match ok 1 - match: '123 456 789' ok 2 - match: '123.456.789' ok 3 - match: '123-456-789' ok 4 - match: '123-456789' ok 5 - match: '123456-789' ok 6 - match: '123-456789123456' ok 7 - match: '123456789123-456' ok 8 - match: '123-4-5-6-7-8-9-1-2-3-456' ok 9 - match: '123-4-5-6-789' ok 10 - match: '1234-56789' ok 11 - match: '123456789-123456' ok 12 - match: '123456789' ok 13 - match: '123456789123456' ok 14 - match: 'x123-456-789x' ok 15 - match: 'x123456-789x' ok 16 - match: 'x123456789x' ok 17 - match: 'x123-456789123-456x' ok 18 - match: 'x123456-789123456x' ok 19 - match: 'x123456789123456x' # NONE of the following should match ok 20 - NO match: '' ok 21 - NO match: 'a' ok 22 - NO match: 'abcdefgh' ok 23 - NO match: 'abcdefghi' ok 24 - NO match: 'abcdefghijklmno' ok 25 - NO match: 'abcdefghijklmnop' ok 26 - NO match: '-' ok 27 - NO match: '--------' ok 28 - NO match: '---------' ok 29 - NO match: '---------------' ok 30 - NO match: '----------------' ok 31 - NO match: '12345678' ok 32 - NO match: '1234567891234567' ok 33 - NO match: '123-45-678' ok 34 - NO match: '123-4567891234-567' ok 35 - NO match: '-123456789' ok 36 - NO match: '123456789-' ok 37 - NO match: '1-23456789' ok 38 - NO match: '12345678-9' ok 39 - NO match: '1-2345678-9' ok 40 - NO match: '12-3456789' ok 41 - NO match: '1234567-89' ok 42 - NO match: '12-34567-89' ok 43 - NO match: '-1234567891234567' ok 44 - NO match: '1234567891234567-' ok 45 - NO match: '1-234567891234567' ok 46 - NO match: '123456789123456-7' ok 47 - NO match: '12345678-91234567' ok 48 - NO match: '12-34567891234567' ok 49 - NO match: '12345678912345-67' ok 50 - NO match: '1--23456789' ok 51 - NO match: '12--3456789' ok 52 - NO match: '123--456789' ok 53 - NO match: '1234--56789' ok 54 - NO match: '123456--789' ok 55 - NO match: '1234567--89' ok 56 - NO match: '12345678--9' ok 57 - NO match: '1--234567891234567' ok 58 - NO match: '12--34567891234567' ok 59 - NO match: '123--4567891234567' ok 60 - NO match: '1234567--891234567' ok 61 - NO match: '1234567891234--567' ok 62 - NO match: '12345678912345--67' ok 63 - NO match: '123456789123456--7' ok 64 - NO match: '1--23456789123456' ok 65 - NO match: '12--3456789123456' ok 66 - NO match: '123--456789123456' ok 67 - NO match: '1234567--89123456' ok 68 - NO match: '1234567891234--56' ok 69 - NO match: '12345678912345--6' ok 70 - list extraction ok 71 - no warnings 1..71

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1028549]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others admiring the Monastery: (3)
As of 2024-03-29 06:15 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found