http://www.perlmonks.org?node_id=1054871


in reply to Looking for a flexible regex...

There'll be more to it, but

$re = qr[^\s*(?:\d+-\d+)(?:\s*,\s*\d+-\d+)*\s*$];; print "$_ : ", $_ =~ $re ? 'ok' : 'bad' for '1-5, 20-250, 37000-41000', '1-25', '1-25,67-324', '1-15, 76-102, 56-98', '1-25,28-43.5', '1-2a, 45-98';; 1-5, 20-250, 37000-41000 : ok 1-25 : ok 1-25,67-324 : ok 1-15, 76-102, 56-98 : ok 1-25,28-43.5 : bad 1-2a, 45-98 : bad

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

Replies are listed 'Best First'.
Re^2: Looking for a flexible regex...
by Anonymous Monk on Sep 19, 2013 at 15:46 UTC
    beautiful! Thanks -- exactly what I was looking for.

      Does it matter to your requirements if ranges are backwards like "5-1,10-8", or if they overlap like "1-6,3-8"? If it does, you probably won't be able to do that with a regex (though if it can be done, someone here will know how). You may need to split the string on commas and then check each section for correctness.

      Aaron B.
      Available for small or large Perl jobs; see my home node.

Re^2: Looking for a flexible regex...
by Anonymous Monk on Sep 22, 2013 at 03:45 UTC
    Sorry -- I found some cases where I need to capture a single number as well.

    example: 1-6,27,105-170,512,670-675 How would I modify your regex?

    qr^\s*(?:\d+-\d+)(?:\s*,\s*\d+-\d+)*\s*$

    Also, I'm curious what is going on in your regex -- I've heard about clustering, but never really made use of it before. What's the benefit of using '?:' here? Also, how does the regex know how to process a string of indeterminate length?

    And to answer other questions -- I do have areas of the script that will error if the ranges are in the wrong order..

      See, I said there'd be more to it :)

      How would I modify your regex? ... I do have areas of the script that will error if the ranges are in the wrong order

      Given that you now are seeking not to just validate a string, but will need to break that string up into its components, I wouldn't use a regex. I'd probably do this:

      $s = "1-6,27,105-170,512,670-675";; @ranges = map{ my( $lo, $hi ) = split '-', $_; $hi //= $lo; die 'Bad input' if $lo > $hi; [ $lo, $hi ] } split ',', $s;; print "@$_" for @ranges;; 1 6 27 27 105 170 512 512 670 675

      Note: That by converting single positions to a range of 1, it avoids the need for special casing later in the code.

      Now, whether that alone is sufficient validation will depend upon where the string is coming from and what you are doing with the ranges later in the script.

      For example, in genomic work, these types of range lists are often (usually) the output from some previous process (Blast or similar), and are thus pretty much guaranteed to correct; ie. properly ordered, sorted, non-overlapping etc.

      But, if this was manual input from a user, you might need to be more stringent. Then you have to decide what to do if the user enters:

      1. malformed ranges;
      2. overlapping ranges;
      3. correctly formed but disordered ranges;

      Some of those you could correct automatically -- eg. sort the list -- others you'd have to report the errors and either die or prompt for corrections.

      Of course, some people would apply the stringent tests even for input coming from a program that is "pretty much guaranteed not to make those mistakes"; and that's a value judgement you'll have to make yourself.

      Also, I'm curious what is going on in your regex -- I've heard about clustering, but never really made use of it before.

      It groups sub-elements of the regex so that one can apply quantifiers that affect that sub-group collectively, rather than indiviual elements.

      how does the regex know how to process a string of indeterminate length?

      This is an expanded explanation of my orginal regex:

      qr[ ^\s* ## from the start of the string, skip w +hitespace (if any) (?: \d+ - \d+ ) ## then grab (at least one) pair of num +bers separated by '-' (?: ## a group \s* , \s* ### a comma, optionally preceded or + followed by whitespace \d+ - \d+ ### and another pair of numbers sep +arated by '-' )* ## zero or more times \s*$ ## to the end of string, optionally ski +pping whitespace if any ]x

      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        Thanks for the great explanation. I truly appreciate it. It's been several years since I worked in perl, and regex was never my strong point. ;) I've always loved how supporting/patient the community is here.