http://www.perlmonks.org?node_id=326871

ChrisR has asked for the wisdom of the Perl Monks concerning the following question:

I have been struggling with a simple (or so I thought) regex for a day now and have only succeeded in confusing myself. Here's the scenario:
A user supplied string may contain 1 or more of the characters smtwhfa in any order. That's as easy as
$string =~ /^[smtwhfa]*$/
But this allows for duplicate characters and I need to be sure that no character is in the string more than once. For example:
swma is OK
smsa is NOT OK

I tried many variations of what I thought could work but to no avail. So, of course man perlre was my next stop. Then searches on Perl Monks and Google and then the camel book. The one thing that seemed to be the answer was a negative lookahead assertion but I don't really understand it. I'm not even sure if this the road I need to take.
Here is some of what I tried:
$string =~ /^[smtwhfa]*(?![smtwhfa])$/ $string =~ /^[s{1}m{1}t{1}w{1}h{1}f{1}a{1}]*$/ $string =~ /^([smtwhfa])(?!\1)*$/
And numerous other more embarrassing attempts.

I know there are other ways of doing this like this very ugly and inefficient example:
my $string = "smasm"; my @array = split(//, $string); my %hash; foreach(@array) { $hash{$_} ++; } foreach(keys %hash) { if($hash{$_} > 1) { print "ERROR\n"; last; } }
But I would really like to do this with a single regex so that it easily portable to other languages.

Any hints or examples that might lead me in the right direction will be greatly appreciated.

Replies are listed 'Best First'.
Re: Regex (lookahead) Confusion
by bart (Canon) on Feb 05, 2004 at 20:57 UTC
    I think you're thinking of something like this:
    /^(?:([smtwhfa])(?!.*\1))*$/

      You are *so* the monk!

      The regex works because if the variable ength of the space (0-full line) before matching the backreference. And lookaheads can be variable length.

      This is essentially the converse approach to what I was thinking trying to solve the problem. Someday...

      while (<DATA>) { chomp; if ( /^(?:([smtwhfa])(?!.*\1))*$/ ) { print "$_ : OK\n"; } else { print "$_ : Not OK\n"; } } __DATA__ swma smqa smsa fhtm ttma t2ms __END__ swma : OK smqa : Not OK smsa : Not OK fhtm : OK ttma : Not OK t2ms : Not OK

      --
      Allolex

        So let's see if I understand this and please correct me if I'm wrong.

        The ?: means that the parens are just for grouping and will load any matches into $1, $2, etc.
        Then the character class in parens
        and then the lookahead...
        ?! means that it is a negative lookahead
        .* means any character 0 or more times (very greedy)
        \1 is related to the character matched from the character class
        It's the .* that is throwing me off. To me, that looks like it would match a single character repeated any number of times but not separated duplicate characters. I just don't understand it ... yet. I will keep looking.
      Bart, you're a genius!! That works perfectly. Now I'll just have to figure out how and why it works.
        perl -MYAPE::Regex::Explain -e 'print YAPE::Regex::Explain->new(q/^(?: +([smtwhfa])(?!.*\1))*$/)->explain' The regular expression: (?-imsx:^(?:([smtwhfa])(?!.*\1))*$) matches as follows: NODE EXPLANATION ---------------------------------------------------------------------- (?-imsx: group, but do not capture (case-sensitive) (with ^ and $ matching normally) (with . not matching \n) (matching whitespace and # normally): ---------------------------------------------------------------------- ^ the beginning of the string ---------------------------------------------------------------------- (?: group, but do not capture (0 or more times (matching the most amount possible)): ---------------------------------------------------------------------- ( group and capture to \1: ---------------------------------------------------------------------- [smtwhfa] any character of: 's', 'm', 't', 'w', 'h', 'f', 'a' ---------------------------------------------------------------------- ) end of \1 ---------------------------------------------------------------------- (?! look ahead to see if there is not: ---------------------------------------------------------------------- .* any character except \n (0 or more times (matching the most amount possible)) ---------------------------------------------------------------------- \1 what was matched by capture \1 ---------------------------------------------------------------------- ) end of look-ahead ---------------------------------------------------------------------- )* end of grouping ---------------------------------------------------------------------- $ before an optional \n, and the end of the string ---------------------------------------------------------------------- ) end of grouping ----------------------------------------------------------------------
      I think that's less efficient than: /^(?=[smtwhfa]*$)(?!.*(.).*\1)/ but I'm too busy right now to find out.
      _____________________________________________________
      Jeff[japhy]Pinyan: Perl, regex, and perl hacker, who'd like a job (NYC-area)
      s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;
        Again, another genius here. The regex you supplied also works perfectly. As far as efficiency goes, I'm not sure it will matter much in this situation. I am interested though in how you would go about testing speed and efficiency for something like this. I'm sure it's not as simple ( and unreliable) as loading up 1000's of strings and timing execution with a watch.

        YAPE::Regex is cool man!
      Bart's regex does the trick. I was thinking of a lazy approach using sort - slower, but easier for the regex neophytes to get a handle on:
      #!/your/perl/here use warnings; use strict; while (<DATA>) { chomp; my $sort = join '', sort split ''; ( ( $sort =~ /^[afhmstw]+$/ ) # only these chars and ( $sort !~ /([afhmstw])\1/ ) ) # no repeats ? print "<$_> OK\n" : print "<$_> Not OK\n"; } __DATA__ smsa smta stmwhas BADsmtaEXAMPLE __END__ <smsa> Not OK <smta> OK <stmwhas> Not OK <> Not OK <BADsmtaEXAMPLE> Not OK

      -QM
      --
      Quantum Mechanics: The dreams stuff is made of

Re: Regex (lookahead) Confusion
by Roger (Parson) on Feb 05, 2004 at 20:53 UTC
    while (<DATA>) { chomp; if (! /([smtwhfa])(?=.*?\1)/) { print "$_ : OK\n"; } else { print "$_ : Not OK\n"; } } __DATA__ smsa smta stmwhas
    And the output -
    smsa : Not OK smta : OK stmwhas : Not OK

    Update: Ah! Thanks people for pointing out that my regex didn't check for non-allowed characters. bart's solution is perfect.

      This is a really good effort, but it fails if characters not matching the given character set are included in the string.

      Unfortunately, I don't have a solution, either because in my mind, there would have to be a lookbehind on a string of variable length, which (of course) will not work.

      #!/usr/bin/perl use strict; use warnings; while(<DATA>) { chomp; if (! /([smtwhfa])(?=.*?\1)/) { print "$_ : OK\n"; } else { print "$_ : Not OK\n"; } } __DATA__ swma smqa smsa fhtm ttma t2ms __END__ swma : OK smqa : OK smsa : Not OK fhtm : OK ttma : Not OK t2ms : OK

      --
      Allolex

      Not quite what the OP wanted. It was required that the word was completely made of allowed characters only. Try this:

      while (<DATA>) { chomp; if (! /([smtwhfa])(?=.*?\1)/) { print "$_ : OK\n"; } else { print "$_ : Not OK\n"; } } __DATA__ smsa smta stmwhas BADsmtaEXAMPLE

      And the output

      smsa : Not OK smta : OK stmwhas : Not OK : OK BADsmtaEXAMPLE : OK

        Except that this appears to be exactly the same code as Roger used.

        --
        Allolex

      Roger, can you please explain this a bit to us mere regex mortals? Thanks!
        if (! /([smtwhfa])(?=.*?\1)/) { | | | | | +----- followed by this character | +--------------- find any of these characters +-------------------- but take the negative of the match

        bart's regex below is much better, see his reply and follow-up's.

Re: Regex (lookahead) Confusion
by Zaxo (Archbishop) on Feb 05, 2004 at 21:06 UTC
    ...may contain 1 or more of the characters smtwhfa in any order.

    That means you want '+' instead of '*' for the quantifier.

    You can make sure there are no repetitions with a hash over the characters:

    if ( $string =~ /^[smtwhfa]+$/ and length $string == do { my %hsh; @hsh{split //, $string} = (); keys %hsh; } ) { # everything's ok }
    That uses the uniqueness of hash keys to enforce your requirement.

    You can do that with regex lookahead by a negated match, $string !~ /([smtwhfa])(?=.*\1)/ after the match for the character class. That returns true if there is no match for a repeated character (untested).

    After Compline,
    Zaxo

      Thanks. You're right, using the * instead of the + allows the regex to return true if the string is blank. I did want to be sure that string contained at least one of the allowed characters.
      $string =~ /^(?:([smtwhfa])(?!.*\1))+$/
Re: Regex (lookahead) Confusion
by Fletch (Bishop) on Feb 05, 2004 at 21:00 UTC
    sub no_dups { my $in = join( '', sort split '', shift ); ( my $out = $in ) =~ y/smtwhfa/smtwhfa/s; $out =~ y/smtwhfa//cd; print "no dupes $in\n" if length $in == length $out; } no_dups( 'afhwtms' ); no_dups( 'smtthfa' ); no_dups( 'smatwhfa' ); no_dups( 'afhwtmsz' );

    Update: Ooop, didn't consider characters not in the allowed alphabet. Added second y/// to correct.

Re: Regex (lookahead) Confusion
by halley (Prior) on Feb 06, 2004 at 14:49 UTC

    Think of the User

    I don't know if you're trying to balk at user-supplied information, or not, but just in case, I thought I'd rant for a minute.

    I am thoroughly annoyed at any web form which gives me an error like the following:

    Credit Card Numbers cannot contain spaces or dashes. Please enter your card number again in the proper format.

    It makes me want to vote the rascals out, whenever an online tax form quibbles about dashes in government identifiers.

    Your Social Security Number has three groups of digits, separated by dashes. To ensure your entry is correct, please re-enter your number in the format, ###-##-####.

    I want to hang up on telephone systems when they whisper this nonsense in my ear:

    You must first dial a One before your number. Please hang up, and dial your number again, beginning with a One, then your party's area code, then your party's number.
    Or conversely,
    You do not need to dial a One before your number. Please hang up, and dial your number again, beginning with your party's area code, then your party's number.
    The security camera is the only thing stopping me from punching an ATM which requires me to enter the extra fraction zeros, when the machine is only equipped to dispense $20 bills.
    Please enter the amout you wish to withdraw: $___.__ Your amount must be a multiple of $20.00.

    If your system can accept the input as it is, then ACCEPT THE INPUT.

    If your system can weed out the obviously extraneous information, then weed it out, and ACCEPT THE INPUT.

    If your system can weed out the extraneous information, but it might not be what the user intended, then weed it out, and ACCEPT THE INPUT. In critical cases, confirm the input with an alternative display, and allow them to re-enter the information if it was interpreted incorrectly.

    If your system can detect the difference between extraneous and possibly mistaken input, then show the user how your code interpreted their input, and ask for confirmation.

    Days of the week? m,t, w-f You have entered: Monday Tuesday Wednesday Friday You may have entered something I don't recognize. Did I understand your schedule correctly? [ Yes ] No

    Lastly, I have seen people try to associate 'H' for Thursday but never 'A' for Saturday. I don't think people will remember either one reliably, especially if they're not native English writers. Think of what your users want to see and specify, not what would fit in a single letter uniquely. These clever one-letter abbreviations should just be an implementation detail, not a user interaction plan.

    Writing user interaction code is more labor-intensive on your part, and more demanding in terms of design consideration. However, you're going to foist any of these frustrations on each one of your users, every time they use it. In aggregate, that's a lot of frustration to dispense for just a few minutes of design time you save.

    Be forgiving in what you accept, and strict in what you produce.

    --
    [ e d @ h a l l e y . c c ]

      Hi Ed! I agree with your comments on webusers. Rant On!

      However, I don't see a webuser entering in config information in a script to modify say a squid config file or a squidGuard config file which is unforgiving to say the least. I would love to be able to have the time to write a script to do just that. If I want to allow modifications to a system configuration file then I have to abide by the configuration rules of the application/service I am changing.

      For me to see a "smtwhfa" in a posting I immediately think about a script to control a configuration file. I could never see asking "any" user to come up with those abreviations for the weekdays. Having been around systems for a while the above mentioned letters are very familiar. I am sure the OP has considered allowing a user to enter things such as Monday, Tuesday, Wednesday etc. then converting it to m,t,w, or even mtw... But I am not going to assume that his request is for a web application.

      If I am going to update a config file that has the ability to accept "smtwhfa" (or whatever the last letter may be) I have to write the program to be able to accept that and maybe s-f, or Mon-Fri, or (Monday, Wednesday-Saturday), etc, etc, etc. This list of and combination of could go on forever.

      I am not too sure that this posting is for a web application for any "user" to be disturbed by the output as much as it is for an incredible regex problem that a lot of people (including myself) have encountered. As for rules to make sure that a user enters in information that you can easily process I don't see that as all bad. I use rules more for myself than I do the user. Many times I have been shocked as an end user entered something into an online form that without some sort of rules resulted in catastrophe. Rules can be a good thing as it allows us not to get caught with our pants down. You can't under estimate the user..

      Thinking of the user is what prompted this post. OK, it was really about my lack of experience when it comes to lookahead assertions. However, I was thinking that the user may not be as familiar with the config files that will be managed by this application. Instead of forcing the user to know all the syntax rules for the config files, I check to make sure the syntax is correct. If it is not correct, I don't think I would want to take a guess at what the user meant. I'm not a mind reader. And if I simply accept the user input without being sure that the syntax and parameters are correct, there is a high probability that the program will fail due to the incorrect config information.

      Let's say you dial a phone number and enter a 1 then area code then the number. You then get the recorded response informing you that you don't have to dial a 1 first. My guess is that you are thinking the phone company should understand that you don't know how to use a telephone and are not willing to learn. Therefore they should just connect you to the number you dialed. OK. Perhaps you misdialed the area code and you really did have to dial a one. Well now the phone company has connected you but not to the number you dialed and not to the number you intended to dial. Now two people are frustrated (you and the person you just disturbed for no reason).

      I think it is very important that we understand most people will say the same thing in different ways. At the same time, we must understand that many software applications only understand the information if it is presented in a specific manner. I'm, sorry if this is offensive to you but I will never trust the user input implicitly and will always strive to be sure the data conforms to its pre-defined parameters.

        If the phone system knows I made an error, and knows the nature of the error, it should recover for my error. If it didn't know these things, then why did it explain my error so accurately?

        The structure of telephone systems is to understand how many more digits to expect from the user at any time. Area codes don't start with 0 or 1. This sets up the system to understand the next three digits. Some locales can't distinguish a new-style area code from a historically significant local prefix. The error messages are completely disregarding the user's goals, and instead forcing the user to learn the artificial local dialing hardware implementation. These vary from region to region, so a traveler currently has to cope with varying machine requirements, which have nothing to do with the goal of making a connection.

        I didn't say you HAD to recover from the user's errors by fixing it. Unlike the examples, it's not always possible. But you can give a more intelligent answer than "start over from scratch, complying with this rule you didn't know, you stupid user."

        Since you're writing the code that helps users accomplish a correct configuration, you know both what the machine expects, and what the user wants to express. You're the ONLY person in this chain who could provide a more intelligent interaction. You can make the computer provide tools to accomplish goals for a wider range of users.

        Don't expect the user to learn everything you did. If they mistype something, then help them accomplish their goals, rather than helping them learn the machine details you had to learn.

        --
        [ e d @ h a l l e y . c c ]

Re: Regex (lookahead) Confusion
by valentin (Abbot) on Feb 05, 2004 at 21:04 UTC
    does this do what you want to?
    my $string = "smasm"; if ( $string =~ /(.).*\1$/ ) { print "$1" ; #match a double char }
Re: Regex (lookahead) Confusion
by MCS (Monk) on Feb 06, 2004 at 02:58 UTC

    It doesn't pertain to your question but the code:

    $string =~ /^[smtwhfa]*$/

    The * is zero or more, not one or more. For one or more use the + sign. (So it would be /^[smtwhfa]+$/) Unless instead of saying

    "A user supplied string may contain 1 or more of the characters smtwhfa in any order."
    you really meant zero or more. Just a frequent mistake some people make that I thought I would point out.

      Like Zaxo said above, you are right about the difference between the * and the +. However, I did mean the string must contain 1 or more characters of the allowed class. I just got my quantifier wrong. Thanks to both you and Zaxo for pointing out the mistake.