Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine

Regex help needed

by ghosh123 (Monk)
on Apr 23, 2012 at 10:06 UTC ( #966564=perlquestion: print w/replies, xml ) Need Help??
ghosh123 has asked for the wisdom of the Perl Monks concerning the following question:

Hi monk, I have a string which contains data of a person delimited by '%'. The third field in the string represents 'ID' tag, I want to extract all thr names which has ID tag starting with 'A'. For example $data variable may contain :

$data = "Johnson%Andrew%AX321%Engineer" ; $data = "Smith%John%BC142%Alberta";

I am using :

print "$2 $1\n" if m/^(.*?)%(.*?)%A/;

But this is giving me

Andrew Johnson ## output Ok John%BC142Smith ## Wrong

But I don't want this. The second half of the regex does not say match up to next % and then match 'A'. Instead it says match upto the next % which is followed by 'A'. Hence for the second $data it is getting past the second % and fetching John%BC142Smith Please help. Thanks.

Replies are listed 'Best First'.
Re: Regex help needed
by davido (Archbishop) on Apr 23, 2012 at 10:28 UTC

    print "$+{first} $+{last}\n" if m/^ (?<last> [^%]+ ) % (?<first> [^%]+ ) % (?<id> A [^%]+ ) /x;

    Really, the critical component is to use a negated character class to match anythig that is not the delimiter, rather than relying on non-greedy specifications.

    If your Perl is old enough to not have named captures, this is equivilent code that is more universally compatible:

    print "$2 $1\n" if m/^([^%]+)%([^%]+)%(A[^%]+)/;

    Update: Added /x and nicer formatting.


Re: Regex help needed
by JavaFan (Canon) on Apr 23, 2012 at 11:10 UTC
    I wouldn't bother with a regexp. I'd use split (technically, that uses a regexp as well):
    my ($first, $last, $id) = split '%'; print "$last $first\n" if substr($id, 0, 1) eq 'A';
      "Technically" split requires a regular expression as its first argument. String patterns appear to be an undocumented extension.
        Huh, what are you talking about? Using strings as patterns is fine. Remember, this is Perl. If Perl expects a pattern somewhere, whatever you put there is a pattern. What you call an "undocumented" extension is nothing different from:
        $foo = "3"; $bar = 4 + $foo;
        or even:
        my $pattern = "foo|bar"; say "Match" if $str =~ $pattern;
        In my snippet, '%' is pattern by virtue of it being the first argument of split, not because of some "undocumented extension".

        Note also this snippet from the split documentation:

        As a special case, specifying a PATTERN of space (' ') will
        split on white space just as "split" with no arguments does.
        Note how the documentation talks about a pattern, while using quotes to delimit said pattern.
Re: Regex help needed
by Anonymous Monk on Apr 23, 2012 at 10:16 UTC
    use 5.010; for my $line (qw( Johnson%Andrew%AX321%Engineer Smith%John%BC142%Alberta )) { my @fields = split /%/, $line; say "$fields[1] $fields[0]" if $fields[2] =~ /^A/; }
Re: Regex help needed
by ansh batra (Friar) on Apr 23, 2012 at 11:09 UTC
    $data = "Johnson%Andrew%AX321%Engineer" ; if($data=~ /.*%.*%A/) { @arr=split('%', $data); print "$arr[0] $arr[1]"; }
    Asumption->both first name and last name will always be there and seperated by '%'

      anish_batra: Here is why your solution is broken in almost the same way as the original poster's code. This will be a slight oversimplification.

      The regexp engine loves to find matches. It's its duty to find them. You are giving it all the tools it needs to match the following string:

      $data = "Johnson%Andrew%BX321%Accountant";

      Here's why:

      • .* will greedily match as big of a string as possible (or nothing at all), so long as a '%' character comes next. In this case, on first pass, it will match "Johnson%Andrew%BX321", stopping just before the "%Accountant" portion of the string...
      • Next, the RE engine moves on to the second .*% term. Oh oh.... for this to match, it needs to backtrack to the first subexpression again.
      • Back to the first sub-expression... The original .*% has been told it was too greedy. Now it tries again and this time matches, "Johnson%Andrew%".
      • Now the second subexpression is allowed to match "BX321%"
      • Finally, 'A' is matched from "Accountant".
      • The regexp engine has done its job: It found a way to make "Johnson%Andrew%BX321%Accountant" match.

      But that's not what the OP actually wanted to have happen. He wanted strings like "Johnson%Andrew%AX321%Accountant" to pass, and "Johnson%Andrew%BX321%Accountant" to fail. You simply showed him another way to get the wrong result again. And, in fact, your solution results in some backtracking within the RE engine, so not only does it provide false positives, it does so inefficiently.

      Either you didn't understand the question, or you did understand it, but didn't test your code. There's no shame in considering a solution that doesn't work. The problem is when it gets posted. This is the third or fourth answer in a row that you've provided which fails to meet the OP's simple requirements. My suggestion always test your code with a variety of possibly valid data-sets before posting answers... at least until accurate responses become second nature. To be honest, I'm still hesitant to post regexp responses until after I've tested them -- they're so easy to get wrong. But the lesson should be test your solutions before posting.

      The Monastery welcomes learners. That's one of the biggest reasons we're here. We all started somewhere. And answering questions is a great way to consider new problems and to learn from them. I'm not suggesting that you refrain from answering. I'm suggesting (and as a fellow PerlMonk asking) that you test your code before posting it. One doesn't learn much from posting broken solutions. One learns by studying how to create a valid solution.

      Furthermore, it does your fellow PerlMonks a disservice posting broken code. Sure, there's more than one way to do it. But another newcomer may not immediately recognize that your solutions have bugs, may use them, and may find out the hard way. That's not good for the user, for Perl, or for the Perl community.

      One suggestion I have... if you're unsure about a solution, you might even consider chatting about it in the CB before posting it. Put it in your scratchpad and say, "Is [pad://anish_batra] a valid solution to [id://123456]?" If it's a good idea, post it. If it's wrong, the folks in the chatterbox will probably gladly explain why.


        ... I'm still hesitant to post regexp responses until after I've tested them ...

        I'd go farther than that. My experience is that whenever I do not test a regex, no matter how simple it may seem, it's guaranteed to be wrong! It's like some kind of corollary of Murphy's Law. For me, regular expressions are the most counter-intuitive concept in CS (or in C-what-we-laughingly-refer-to-as-S). I think I have a fairly good understanding of regexes and a fair body of experience with them, but I would never trust my untested opinion.

        Caveat Programmor

      Unfortunately, that solution has a problem similar to that in the OPed code:

      >perl -wMstrict -le "for my $data (qw( Smith%John%BC142%Alberta Johnson%Andrew%AX321%Engineer )) { if($data =~ /.*%.*%A/) { my @arr = split('%', $data); print qq{$arr[0] $arr[1]}; } } " Smith John Johnson Andrew
Re: Regex help needed
by Kenosis (Priest) on Apr 23, 2012 at 15:08 UTC

    This slightly modifies what you already had:

    print "$2 $1\n" if /^([^%]+)%([^%]+)%A/;

    Hope this helps!

    Update: Just now noticed that this is similar to a solution posted by davido. Will remember to have coffee *first*, read all, and then post...

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://966564]
Front-paged by Arunbear
roho replaces stale cookies with fresh on the platter on the sideboard.
roho takes a cookie from the platter on the sideboard.
roho takes a handful of cookies from the platter on the sideboard.
roho gets a cup of tea to dunk cookies.

How do I use this? | Other CB clients
Other Users?
Others meditating upon the Monastery: (7)
As of 2017-04-27 09:47 GMT
Find Nodes?
    Voting Booth?
    I'm a fool:

    Results (502 votes). Check out past polls.