Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

Recognizing 3 and 4 digit number

by htmanning (Friar)
on Jan 02, 2017 at 00:58 UTC ( [id://1178784]=perlquestion: print w/replies, xml ) Need Help??

htmanning has asked for the wisdom of the Perl Monks concerning the following question:

Monks, I'm using the following to recognize 3 and 4 digit numbers.
my $digits_4 = qr{ \b \d{4} \b }xms; $text =~ s{ ($digits_4) } {<a href="resident-info.pl?do_what=view&unit=$1"><b>$1</b></ +a>}xmsg; my $digits_3 = qr{ \b \d{3} \b }xms; $text =~ s{ ($digits_3) } {<a href="resident-info.pl?do_what=view&unit=$1"><b>$1</b></ +a>}xmsg;
It works, but it also tags numbers within a phone number 555-555-5555. How can I make it work only if there is a space before the 4 digits? That would preclude it from being recognized in a string of numbers such as dates and phone numbers. Thanks.

Replies are listed 'Best First'.
Re: Recognizing 3 and 4 digit number
by kcott (Archbishop) on Jan 02, 2017 at 08:06 UTC

    G'day htmanning,

    Rather than drip-feeding us additional requirement changes, it would be much better if you started with something like this:

    #!/usr/bin/env perl -l use strict; use warnings; use Test::More; my @tests = ( ['12', '12'], ['123', '>123<>123<'], ['1234', '>1234<>1234<'], ['12345', '12345'], ['123 4567 890', '>123<>123< >4567<>4567< >890<>890<'], ['123 4567 89', '>123<>123< >4567<>4567< 89'], ['123-4567-890', '123-4567-890'], ['01/02/2017', '01/02/2017'], ['2017-01-02T17:01:34', '2017-01-02T17:01:34'], ["12\n345\n6789\n0", "12\n>345<>345<\n>6789<>6789<\n0"], ); plan tests => scalar @tests; my $re = qr{(?x: (?<![/-]) \b ( [0-9]{3,4} ) \b (?![/-]) )}; for my $test (@tests) { my ($string, $exp) = @$test; (my $got = $string) =~ s/$re/>$1<>$1</g; is($got, $exp, "Testing: $string"); }

    All of those tests were successful (output in spoiler):

    This helps both you and us. You can add examples of representative input and the wanted output. There's a clear indication of the test data used along with expected and actual results. You can add new tests if necessary; tweak the regex if required; and ensure previous tests still pass. If you run into difficulties, we have all the information we need to provide immediate help. You get a faster, useful response and we don't have the frustration of an ever changing specification.

    As I said above, all of those tests were successful. If my test data is fully representative of your data, and my expectations match yours, then you may have a solution. However, if you have other use cases (the more likely scenario), modify the code above, change the regex if need be, and get back to us if you have further problems.

    Here's some notes on your code and what I did differently.

    Modifiers
    You've used a lot of modifiers, most in three places, and most are unnecessary.
    • x: you can specify this once, as I did, with qr{(?x: ... )}. You could have done the same with m & s if they were needed (see the next two points).
    • m: you haven't used any assertions regarding the start/end of line/string - this one is unnecessary. My last test shows this: it has four lines and substitutions occur correctly on lines 2 and 3.
    • s: you haven't used a '.' in the regex; this modifier allows '.' to (also) match newlines - this one is unnecessary.
    • g: this one is fine (although see Source Data below regarding using it twice).
    • See also: "perlre: Modifiers".
    Captures
    Instead of wrapping your regex in a capture as part of the substitution, add it to the the regex when created, cf. qr{... ( [0-9]{3,4} ) ...) in my code. This would have removed the problem discussed elsewhere in this thread.
    Source Data
    You probably don't want two lots of substitutions on the same string ($text). In my code, [0-9]{3,4} handles all the use cases; of course, you may have other use cases.

    See also: "perlre: Lookaround Assertions" and "perlrecharclass: Bracketed Character Classes".

    — Ken

Re: Recognizing 3 and 4 digit number
by BrowserUk (Patriarch) on Jan 02, 2017 at 01:10 UTC

    I can't help but think there is more to this requirement than you've specifed, but based on what you've asked for, +a little bit more, try:

    /\s\d{3,4}\s/ and print for 'abd 123 fred', '555-5555-6666', 'ab 12345 + xd';; abd 123 fred

    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". The enemy of (IT) success is complexity.
    In the absence of evidence, opinion is indistinguishable from prejudice.
      I put a backslash s in front of the backslash d and it works, but puts a percent 20 in the url. It still works but there must be another way. Thank you.

        If you don't want the space in the substitution, don't include it in the capture group!

        $ perl -E' my $re = qr/ \s ( \d{3,4} ) /x; say ">$1<" if " 5678" =~ /$re/; ' >5678<


        The way forward always starts with a minimal test.
        Impossible :) nothing in that snippet does "value" encoding/escapeing
Re: Recognizing 3 and 4 digit number
by davido (Cardinal) on Jan 02, 2017 at 16:44 UTC

    Just use negative look-arounds to ensure that you do not have a digit on either side of a 3 or 4 digit number:

    my @strings = ( 'foo1234bar', '1234 5678', 'abcd9012f123ab', '123', ' 123', '123 ', ); foreach my $string (@strings) { my(@nums) = $string =~ m/(?<!\d)(\d{3,4})(?!\d)/g; local $" = ','; print "<$string>: (@nums)\n"; }

    The output:

    <foo1234bar>: (1234) <1234 5678>: (1234,5678) <abcd9012f123ab>: (9012,123) <123>: (123) < 123>: (123) <123 >: (123)

    An advantage of using negative lookarounds is that you don't have to explicitly accommodate conditions such as the start or end of string or line. The negative lookarounds are just saying "a digit cannot come immediately before or after a sequence of 3 or 4 digits". With positive lookarounds you would have to say "either a non-digit or end of string must come before and after a sequence of 3 or 4 digits." That would look something like this (untested):

    m/(?<^|\D)(\d{3,4})(?=$|\D)/mg

    So rather than asserting what must come before and after the digits, the regexp becomes simpler if we just assert what cannot come before or after.


    Dave

      G'day Dave,

      At first glance, I thought your regex was better than mine and so I decided to try it. I plugged it into my code but it failed on the phone number and date tests (details in spoiler). The OP requirements are not the best but excluding phone numbers and dates seems to be definitely wanted.

      — Ken

        Wah! I guess I got excited and answered before noticing that we wanted to disqualify things that look like phone numbers. Sorry.

        This isn't tested:

        m/(?<![\d-])(\d{3,4})(?![\d-])/

        But it would run afoul of phone numbers using commas to separate, or wrapping parens around area codes.

        It might be useful to take a first pass and keep a list of offsets for "numbers" that should be ignored. It's probably easier to match a phone number with existing libraries than to match a 3 or 4 digit number that is not part of a phone number. In other words, on first pass, identify phone numbers, IP addresses, and other problematic numbers, and push their offsets and lengths into an array. Then on second pass disqualify any number that falls within one of the offset/length sets.


        Dave

Re: Recognizing 3 and 4 digit number
by AnomalousMonk (Archbishop) on Jan 02, 2017 at 18:49 UTC

    Update: On second thought, this post is really more like a reply to kcott's Re: Recognizing 3 and 4 digit number and probably should have been posted as such originally. Oh, well...

    htmanning: My remarks are further to the careful and detailed remarks of kcott here and, I hope, are in the same spirit.

    I certainly agree with the recommendation (and its rationale) of doing development and posing questions to your fellow monks in a Test::More framework.

    I tend to differ with kcott in the area of regex best practice. All the following are certainly personal best practices in this area, and are based largely on the regex Perl Best Practices (PBP)s of TheDamian.

    kcott implies that one should avoid using the  /x /m /s modifiers where they are not necessary. I think they are (almost) always necessary: They clarify intent and make it easier to think about what a regex, that most slippery and counterintuitive of things, is doing. When dealing with regexes, the less you have to think about the better. The result is that almost without exception, every  qr// m// s/// operator I write ends up with an  /xms tail.

    The  /x modifier allows comments, saviours of sanity, in regexes. kcott suggests the embedded
        qr{(?x: pattern with whitespace )}
    usage where comments are needed. This is undesirable IMHO for two reasons: two opportunities for inadvertent literal spaces before and after the  (?x: ... ) expression, giving you, e.g.,
        qr{ (?x: pattern with whitespace ) }
    and potential brain-hurt. The alternate form
        qr{(?x) pattern with whitespace }
    is better, but still leaves room for a leading literal space to creep in:
        qr{ (?x) pattern with whitespace }
    Oops. Just write  qr{ ... }xms and be done with it.

    What if you want literal space characters in your regex when using the  /x modifier? I prefer the  [ ] usage over the  \ usage (which is hard to see and has to be explained: that's a backslash before a | an escaped literal space). A string containing literal spaces can be represented as
        qr{ \Qstring with some literal spaces\E }xms

    The justification for always using the  /m /s modifiers is a bit different: They reduce the "degrees of freedom" of regex behavior.

    What does  . (dot) match? "Dot matches everything except a newline except where modified by the  /s modifier, in which case it matches everything." That's too much to think about. "Dot matches all" is a lot simpler, and that's what you get with the  /s modifier, even if you never use a dot operator. What if you actually want to match "everything but a newline"? Use  [^\n] in that case; it does the job and perfectly conveys your intention. I have sometimes seen  (?-s:.) and  (?s:.) used to invoke the different behaviors of dot. Don't. It's just more potential brain-hurt.

    Similarly, the behaviors of the  ^ $ operators are constrained | expanded by the  /m modifier. What if you want only their commonly used end-of-string behaviors? The  \A \z \Z operators were invented for this purpose.

    With regard to the use of capture groups in  qr// operators: This is something else I try assiduously to avoid.

    Say you have two Regexp objects  $rx $ry with an embedded capture group in each. They might be used in a substitution:
        $string =~ s{ foo $rx bar $ry baz }{$1$2}xmsg;
    If you change the pattern match to
        $string =~ s{ foo $ry bar $rx baz }{$1$2}xmsg;
    do you also have to change the order of the capture variables  $1 $2 in the replacement string? The problem, of course, is that capture variables correspond in an absolute way to the order of capture groups in the  s/// match. The question is highlighted more sharply if the captures appear explicitly in the  s/// match:
        $string =~ s{ foo ($rx) bar ($ry) baz }{$1$2}xmsg;
    to
        $string =~ s{ foo ($ry) bar ($rx) baz }{$1$2}xmsg;  # switch $1 $2 also?
    The \gn relative back-reference extension of Perl release 5.10 eases the problem of capture group numbering somewhat, but capture group variables are still staunchly absolutist! (The  (?|alternation|pattern) construct of 5.10 also eases the capture group numbering problem a bit.)


    Give a man a fish:  <%-{-{-{-<

      G'day AnomalousMonk,

      [Your Update just appeared as I hit [reply]. I think your post is fine where it is: htmanning gets a notification of your response with an alternative point of view and you had sent me a /msg anyway, so I was aware of it (thanks for that).]

      "I tend to differ with kcott in the area of regex best practice."

      While we certainly differ in some areas, I don't think the gulf is as wide as you suggest. I had originally intended to mention PBP in my post: I had a very long (over an hour) interruption in the middle of typing it and, when I finally returned to it, forgot to include the PBP part. My response below covers the points I wanted to make.

      I was very impressed with PBP when I first read it over a decade ago — in fact, I read it cover-to-cover twice — and started using most (if not all) of its recommendations in my code. I suspect that, 10 years ago, our views on "regex best practice" may have been perfectly aligned. I still use much of PBP; although, these days, it's just become part of my standard practices and I don't really think of it in terms of following those specific recommendations. One area that I have departed from is adding /msx to the end of every regex.

      "kcott implies that one should avoid using the  /x /m /s modifiers where they are not necessary. I think they are (almost) always necessary: ..."

      I wasn't trying to imply anything as strong as "should avoid"; rather, my comments were intended to convey something closer to "could avoid".

      Many organisations have Perl coding standards based on PBP. These are often quite inflexible: "You must write your matches like this: m{...}msx!". On the odd occasion that I've been faced with this, especially for short-term contracts, I just take the pragmatic approach and do it. Unfortunately, many of the programmers have no idea why they're doing this: I consider this to be a real problem. So, use all of those modifiers if your pay packet relies on it, but understand what they do and which are really required for the code being written.

      I think we're pretty much on the same page with /x, so I'll say no more about that.

      We definitely seem to be at odds with /m and /s. Perhaps it's a function of the type of data we normally process but I rarely need those: sometimes I need one of them; I need both far less often. There's not a lot more I can say about that: "(almost) always necessary" is not my experience.

      Using the qr{(?mods:...)} form over the qr{...}mods form is something of a personal preference. I've only been using it for a year or two. The latter form makes the modifiers global: you can't get finer control such as qr{(?mo:...)(?ds:...)} or qr{(?mo:...(?ds:...)...)}. Having said that, my requirements for such fine control are exceptionally limited. I really have no strong feelings regarding which form people choose to use. I don't think your arguments against using qr{(?mods:...)} because of potential typos are particularly compelling: I'm far more likely to not release the Shift key quickly enough and terminate a statement with a colon (and that can be a much harder bug to track down).

      Whether or not it's a good idea to include captures in qr// is a matter of context: hardly something to be "assiduously" avoided. Where it's used like I did (s/$re/.../), there's no problem. The issue with the OP code was capturing the entire match (s/($re)/.../) when only part of the match was wanted in $1.

      — Ken

        We definitely seem to be at odds with /m and /s. ... I rarely need those: sometimes I need one of them; I need both far less often.

        My motive for always using the  /ms modifier cluster (in addition to /x, of course) is to foster clarity, and clarity is always a necessity :) Clarity is improved because the  . ^ $ operators have unvarying behaviors. Sometimes one is forced to be devious and must sacrifice clarity of expression, but that's what comments are for!

        ... the qr{(?mods:...)} form over the qr{...}mods form ... The latter form makes the modifiers global: you can't get finer control such as qr{(?mo:...)(?ds:...)} or qr{(?mo:...(?ds:...)...)}.

        The docs say this finer control is possible:  (?mo-ds) and  (?mo-ds:pattern) are rigorously scoped:

        c:\@Work\Perl\monks>perl -wMstrict -le "my $s = qq{aa \n bb \n cc}; ;; print qq{A: match, \$1 '$1' @ $-[1] \$2 '$2' @ $-[2] \$3 '$3' @ $-[ +3]} if $s =~ m{ \A ((?-s: .+)) .+ ((?-s: .+)) .+ ((?-s: .+)) \z }xms; ;; print qq{B: match, \$1 '$1' @ $-[1]} if $s =~ m{ \A ((?-s: .+)) \z }xms; ;; print qq{C: match, \$1 '$1' @ $-[1]} if $s =~ m{ \A ((?-s: (?s: .+))) \z }xms; " A: match, $1 'aa ' @ 0 $2 ' ' @ 9 $3 'c' @ 11 C: match, $1 'aa bb cc' @ 0
        (Tricky to put together a meaningful example for this!)

        That said, I would never write regex A as above, but rather as:

        c:\@Work\Perl\monks>perl -wMstrict -le "my $s = qq{aa \n bb \n cc}; ;; print qq{A: match, \$1 '$1' @ $-[1] \$2 '$2' @ $-[2] \$3 '$3' @ $-[ +3]} if $s =~ m{ \A ([^\n]+) .+ ([^\n]+) .+ ([^\n]+) \z }xms; " A: match, $1 'aa ' @ 0 $2 ' ' @ 9 $3 'c' @ 11
        Don't mess with dot (or  ^ $ either): much less potential for brain-hurt.

        Update: Another version of regex A:

        c:\@Work\Perl\monks>perl -wMstrict -le "my $s = qq{aa \n bb \n cc}; ;; print qq{A: match, \$1 '$1' @ $-[1] \$2 '$2' @ $-[2] \$3 '$3' @ $-[ +3]} if $s =~ m{ \A (?-s) (.+) (?s) .+ (?-s) (.+) (?s) .+ (?-s) (.+) \z }xms; " A: match, $1 'aa ' @ 0 $2 ' ' @ 9 $3 'c' @ 11
        In the context of global dot-matches-newline behavior, successive  (?-s) and  (?s) turn newline matching off and on, respectively. Again, I wouldn't actually write a regex this way unless my feet were being held to the fire.


        Give a man a fish:  <%-{-{-{-<

Re: Recognizing 3 and 4 digit number
by tybalt89 (Monsignor) on Jan 02, 2017 at 02:06 UTC
    my $digits_4 = qr{ (?<=\ ) \d{4} \b }xms; my $digits_3 = qr{ (?<=\ ) \d{3} \b }xms;
      Okay, this worked BUT I just realized I cannot rely on a space to signal a valid number. Sometimes the number starts the text field so there is no space. I'm trying to recognize only those numbers that aren't followed or proceeded by a slash, dash, etc., that would indicate a phone number or date.
        my $digits_4 = qr{ (?<![\/-\ \w]) \d{4} (?![\/-\ \w]) }xms;

        untested...

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1178784]
Approved by Athanasius
Front-paged by kcott
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (4)
As of 2024-03-29 08:09 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found