http://www.perlmonks.org?node_id=1025247

gautamparimoo has asked for the wisdom of the Perl Monks concerning the following question:

Hi. I have the specification to build the following numeric string: 981890or981891 followed by 10 digits or atmost 4 seperators(.,-,:)+ 10 digits. So the example matching this should be: 1. 9818902365894598 2.9818 9021 2454 2170 3.9818-9145-6896-2146 The regex I am trying is as follow:

9818[\D]?9[0|1][\D]?\d{2}[\D]?\d{4}[\D]?\d{4}

But this looks inefficient. Please suggest:

Replies are listed 'Best First'.
Re: Regex Help
by davido (Cardinal) on Mar 25, 2013 at 08:28 UTC

    More than looking inefficient, it looks unnecessarily cluttered, which can contribute to bugs:

    • [\D]? is the same as \D?. This is repeated four times.
    • [0|1] is probably a mistake; character classes don't use alternation, and I don't see any | characters in your sample input. You probably mean [01]
    • \D?\d{4} is repeated twice in a row. How about (?:\D?\d{4}){2} ?

    Making those changes would yield:

    9818\D?9[01]\D?\d{2}(?:\D?\d{4}){2}

    Now with the /x modifier, you can further clarify things like this:

    m/ 9818 # A literal. \D? # Optional non-digit. [01] # Require a zero or a one. \D? # Another optional non-digit. \d{2} # Require two digits. (?: # Group but don't capture. \D? # Another optional non-digit. \d{4} # Followed by four digits. ){2} # Repeated twice. /x

    As for efficiency, what problems are you encountering? If you're dealing with huge input you're probably IO bound anyway.


    Dave

      Thnks davido your regex just cleaned it up. But what modifiers or assertion should i use to limit matching such that it does not match it in different lines in a text file ie only match if this pattern is specified in one line not across different lines. Pl tell?

        Replace \D? with a more explicit character class. \D will match anything that is not a numeric digit. Newlines (\n) are included in "anything that is not a numeric digit".


        Dave

Re: Regex Help
by hdb (Monsignor) on Mar 25, 2013 at 08:10 UTC

    If you had no separators, the regex would look like

    /98189(0|1)\d{10}/

    Do you have the option to remove the separators first?

    s/[\s-:,]//g;

    Probably not worth it if your initial regex works satisfactorily.

Re: Regex Help
by Anonymous Monk on Mar 25, 2013 at 07:55 UTC

    But this looks inefficient

    Does it work for your porposes?

Re: Regex Help
by AnomalousMonk (Archbishop) on Mar 25, 2013 at 19:19 UTC