http://www.perlmonks.org?node_id=200595

jmcnamara has asked for the wisdom of the Perl Monks concerning the following question:


In response to a CB request for a regex to match floating point numbers, I suggested the regex from perlfaq4*:     /^([+-]?)(?=\d|\.\d)\d*(\.\d*)?([Ee]([+-]?\d+))?$/

At that point jarich asked if the look-ahead ?= should be a clustering ?:.

In fact the regex seems to contain other unnecessary capturing as well. The following seems to be functionally equivalent but without the capturing:      /^[+-]?(?:\d|\.\d)\d*(?:\.\d*)?(?:[Ee][+-]?\d+)?$/

Is there any reason for this capturing in a FAQ about matching? Should it be changed?

Also, the decimal part of the floating point regex is different from the decimal regex on the previous line. Is there any reason for this inconsistency?

Here is the test frame I used:

#!/usr/bin/perl -wl use strict; my @nums = qw( 0e0 0 +0 -0 1. 0.14 .14 1.24e5 24e5 -24e-5 2.3. 2.3.4 1..2 ); for (@nums) { # Print only if the match fails # perlfaq4 regex print "1: ", $_ if ! /^([+-]?)(?=\d|\.\d)\d*(\.\d*)?([Ee]([+-]?\d+ +))?$/; # perlfaq4 regex modified print "2: ", $_ if ! /^[+-]?(?:\d|\.\d)\d*(?:\.\d*)?(?:[Ee][+-]?\d ++)?$/; # perlfaq4 decimal regex extended to match floating point print "3: ", $_ if ! /^[+-]?(?:\d+(?:\.\d*)?|\.\d+)([Ee][+-]?\d+)? +$/; }

--
John.

* Regexp::Common was also suggested.

Replies are listed 'Best First'.
Re: Matching floats according to perlfaq4
by demerphq (Chancellor) on Sep 25, 2002 at 12:43 UTC
    (Quotes are out of order deliberately)

    Also, the decimal part of the floating point regex is different from the decimal regex on the previous line. Is there any reason for this inconsistency?

    I believe that the reason for the difference is that the two were written by different people with different intentions. I think the author of the floating point matcher originally wanted to be able to parse the value out into its components (and didn't convert it properly for general FAQ use, that or failed to mention the added bonus of the pattern).

    Note that $1 is the sign, $2 is the fractional part and $3 is the exponent part and $4 is the numeric part of the exponent. By adding an extra parens we would have $1=sign, $2=integer $3=fraction $4=exponent $5=exponent_number. This simplicity of extracting the compnent parts wouldn't be possible with the decimal matcher pattern (where would you put in the capturing braces?)

    At that point jarich asked if the look-ahead ?= should be a clustering ?:.

    Absolutely not (as pope has pointed out it will match non numbers). ?= is a zero width look ahead assertion (similar to a \b). Thus it neither captures nor affects the pattern after it (except that the matcher wont get to the stuff following if it fails to match). Its utility is in this situation is to ensure that the \d*(\.\d*)? part matches something (which it need not.), I'm almost certain that this was to simplify the regex so that the above point (about parsing the number into its parts) would be possible. Neither the ?: variation or the decimal number matcher has this splitting ability.

    Anyway thats my theory, but the more I look at the FAQ the more I think Im right.

    --- demerphq
    my friends call me, usually because I'm late....

Re: Matching floats according to perlfaq4
by pope (Friar) on Sep 25, 2002 at 11:47 UTC
    The second regex matches .1.1 (add this to the test case), which is wrong.

    -- pope who is not a pope, or the pope


      You are right. A look ahead is required in the second regex:     /^[+-]?(?=\d|\.\d)\d*(?:\.\d*)?(?:[Ee][+-]?\d+)?$/

      --
      John.

      /^([+-]?)(?=\d|\.\d)\d*(\.\d*)?([Ee]([+-]?\d+))?$/

      ==

      /^[+-]?(\d+\.\d*|\d*\.\d+)(e[+-]?\d+)?$/i

      The look-ahead is to ensure that a '.' on its own isn't matched, but that (eg) '.04' and '4.' is. I'm not sure how correct the latter form is. The e+3, E-2 etc bit may or may not match; I don't think (example) '1.2e' on its own is meaningful.

        ...and of course I meant

        /^ [+-]? ( \d+ (\.\d*)? | \.\d+ ) (e[+-]?\d+)? $/ix

        ... really (whitespace added for readability)

        Of course it wont match 1.2e so that isnt a worry. And 4. certainly is meaningful (to perl anwyay). Try it...

        Also as I pointed out in my earlier reply the two regexes are _not_ == to each other. They are similar in that they match or reject similar data, however the parts that they match and the way they get captured for later use are radically different.

        For instance

        /([A-Za-z_0-9])(\w)(\w)/
        matches/rejects the same data as
        /(\w{3})/
        But the utility of the two is totally different....

        :-)

        --- demerphq
        my friends call me, usually because I'm late....

Re: Matching floats according to perlfaq4
by BrowserUk (Patriarch) on Sep 26, 2002 at 00:25 UTC

    At the risk of once again showing my lack of aptitude for regexes, below is what I have currently settled on. It handles all the cases you list, accepting or rejecting them as appropriate.

    This seems like a good opportunity to get another set of eyes to look it over.

    my $num = qr! [+-]? (?: \d*? )? (?: (?<=\d)\.? | \.?(?=\d) ) (?: \d*?)? (?: \d[Ee] (?: [+-]?\d+ ) )? !ox;

    I'm sure there is something wrong with it, but so far it has passed everything I have throw at it.

    It was ponted out to me (thankyou sauoq) that this only works if used in a context where it is tightly anchored on both ends.

    Minimally, m/^$num$/o.

    Note: \b wont work.

    For my current learning exercise, that is fine as it will always be used as a part of a larger regex that will so constrain it.

    If anyone wants to point out how to modify it to make it work without anchoring, I like to see and note the changes for my ed.

    For 'real work' I would use Regex::common of course, especially given it heritage and guardianship. Now if only there was a Regex::Uncommon that did everything else, I wouldn't have to learn regexes at all:)


    Cor! Like yer ring! ... HALO dammit! ... 'Ave it yer way! Hal-lo, Mister la-de-da. ... Like yer ring!
      I dont like this regex. It seems to me to be more complicated (and thus less efficent) than the FAQ version with no added benefit. For instance why replace (?=\d|\.\d)\d*(\.\d*)? with all of this
      (?: \d*? )? (?: (?<=\d)\.? | \.?(?=\d) ) (?: \d*?)? (?: \d
      (the last paren is unterminated deliberately)

      --- demerphq
      my friends call me, usually because I'm late....

Re: Matching floats according to perlfaq4
by rbi (Monk) on Sep 26, 2002 at 10:54 UTC
    I use this routine that I slightly changed from one posted here long ago, I think by Vroom, that I cannot find now in the site. It seems to work on those tests.
    #!/usr/bin/perl -wl use strict; my @nums = qw( 0e0 0 +0 -0 1. 0.14 .14 1.24e5 24e5 -24e-5 2.3. 2.3.4 1..2 ); for (@nums) { print "ok: ", $_ if is_a_number($_); } ############### sub is_a_number { ############### my $var = $_[0]; $var =~ s/^\s+//; $var =~ s/\s+$//; if ($var =~ /^([+-]?)(\d+\.|\.\d+|\d+)\d*([Ee]([+-]?\d+))?$/) { return(1) } else { return(0) } }
    Regards, Roberto
(Regex Golf) Re: Matching floats according to perlfaq4
by sauoq (Abbot) on Sep 26, 2002 at 21:39 UTC
    /^[+-]?(?=\d|\.)\d*\.?\d*(?:[eE][+-]?\d+)?$/

    Update: Fixed to avoid matching "+."

    /^[+-]?(?=\.?\d)\d*\.?\d*(?:[eE][+-]?\d+)?$/

    Update: My test code with additional cases.

    #!/usr/bin/perl -w use strict; my @good = qw( 0e0 0 +0 -0 1. 0.14 .14 1.24e5 24e5 -24e-5); my @bad = ('', qw(. +. 1e e1 2.3. 2.3.4 1..2 .1.1 e.1 e1.1 .1a 1.a 1.1a 1a +.1 1.a1)); my $pat = qr/^[+-]?(?=\.?\d)\d*\.?\d*(?:[eE][+-]?\d+)?$/; print "GOOD Tests (Should match.)\n"; print /$pat/ ? ' ': 'no ', "match: '$_'\n" for @good; print "\nBAD Tests (Should not match.)\n"; print /$pat/ ? ' ': 'no ', "match: '$_'\n" for @bad;

    Update: This can be made even shorter than the solution posted by an Anonymous Monk if you change it to use capturing parens and the /i modifier instead of the [Ee] character class as he does in his.

    /^[+-]?(?=\.?\d)\d*\.?\d*(e[+-]?\d+)?$/i

    In response to and agreement with demerphq's reply, I don't actually recommend this last one at all. I just wanted to show that it could be golfed further.

    -sauoq
    "My two cents aren't worth a dime.";
    
      I like your modification of the FAQ version, but id prefer to write it as
      /^[+-]?(?=\.?\d)\d*(?:\.\d*)?(?:[Ee][+-]?\d+)?$/
      As it makes more sense to me and I dont like the /i modifier. (Also I dont see any reason to capture the last block if you arent going to capture the others... :-)

      Although im aware you were trying to make the regex shorter... Personally I wouldn't.

      --- demerphq
      my friends call me, usually because I'm late....

        Thanks. To be clear, I prefer that one too. I wasn't even trying to make a shorter version, just a simpler version.

        The only reason I put that one up was because I had already prefixed "(Regex Golf)" to the node after I noticed mine was shorter than the ones in the OP but then later realized that the Anonymonk's was shorter than mine due to his ugly little tricks. So, in a moment of poor judgement, I showed that I could use those ugly little tricks too. :-)

        -sauoq
        "My two cents aren't worth a dime.";