Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

regex negative lookahead behaviour

by shemp (Deacon)
on Jul 18, 2003 at 17:02 UTC ( [id://275679]=perlquestion: print w/replies, xml ) Need Help??

shemp has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to use a negative lookahead assertion in a regex, and i don't understand what's going on with 1 form of the regex. I'm parsing snail mail addresses, particularly for horrible beasts like "456 4 1/2 MILE RD". For this address, the street number is 456, and the street name is "4 1/2 MILE RD" (four and one-half mile road). If the word "MILE" is not there, then the 1/2 is generally treated as a unit number. (im not going to go into the complete details) Suffice to say that i'm looking to detect a 1/2, followed by something other that MILE. So, a first attempt at a regex is like this:
if ($address =~ /(1\/2)\s*(?!MILE)/i) { ...
If $address = "4 1/2 MILE RD" this evaluates to true, but i dont see why. There is a "1/2", followed by whitespace. Then there is "MILE", which evaluates to false because of the negative lookahead.

If i change the regex so the whitespace is "\s+", then the match evaluates to false, as i expected. But both * and + are greedy, so all whitespace should be sucked up either way, but with *, it should be optional. And i will have cases where there isnt any whitespace.

I used the following, which works like i want it to:
if ($address =~ /(1\/2)(?!\s*MILE)/i) { ...
Myself and another programmer have absolutely no idea why the first regex doesnt work as planned, so any insight would be greatly appreciated.

BTW: this is a reduced case, yes we have other boundary conditions, i'm just perplexed about the regex behaviour

Replies are listed 'Best First'.
Re: regex negative lookahead behaviour
by broquaint (Abbot) on Jul 18, 2003 at 17:14 UTC
    This is due to the nature of * and while it is a greedy quantifier, it is also quite a lazy quantifier. In this case \s* doesn't match any space, since it doesn't have to, and subsequently " MILE" is matched which doesn't match "MILE" so it's a successful match. Hopefully this quick oneliner and it's output should make that explanation clearer
    shell> perl -Mre=debug -e '$_="1/2 MILE"; m{ 1/2 \s* (?!MILE) }x' Freeing REx: `,' Compiling REx ` 1/2 \s* (?!MILE) ' size 11 first at 1 1: EXACT <1/2>(3) 3: STAR(5) 4: SPACE(0) 5: UNLESSM[-0](11) 7: EXACT <MILE>(9) 9: SUCCEED(0) 10: TAIL(11) 11: END(0) anchored `1/2' at 0 (checking anchored) minlen 3 Guessing start of match, REx ` 1/2 \s* (?!MILE) ' against `1/2 MILE'.. +. Found anchored substr `1/2' at offset 0... Guessed: match at offset 0 Matching REx ` 1/2 \s* (?!MILE) ' against `1/2 MILE' Setting an EVAL scope, savestack=3 0 <> <1/2 MILE> | 1: EXACT <1/2> 3 <1/2> < MILE> | 3: STAR SPACE can match 1 times out of 32767... Setting an EVAL scope, savestack=3 4 <1/2 > <MILE> | 5: UNLESSM[-0] 4 <1/2 > <MILE> | 7: EXACT <MILE> 8 <1/2 MILE> <> | 9: SUCCEED could match... failed... 3 <1/2> < MILE> | 5: UNLESSM[-0] 3 <1/2> < MILE> | 7: EXACT <MILE> failed... 3 <1/2> < MILE> | 11: END Match successful! Freeing REx: ` 1/2 \s* (?!MILE) '

    HTH

    _________
    broquaint

Re: regex negative lookahead behaviour
by BrowserUk (Patriarch) on Jul 18, 2003 at 17:08 UTC

    The problem is that whilst \s* is greedy, it will also match zero spaces, which means that '1/2' followed by zero spaces is not followed by 'MILE', because it is followed by ' MILE' which satisfies the condition, so the expression is true.

    The only thing that overrides a regexes natural greediness is it's desire to achieve a match. If it can, it will, and it can, so it does:)


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller

      The only thing that overrides a regexes natural greediness is it's desire to achieve a match.
      Normal regexes are greedy, but not selfish; they are willing to give characters back if it's for the greater good.

      But you can use the (?>pattern) construct to create "selfish" regexes: they only try to match once at every position. If the pattern inside is greedy, it will never give up chars to make the entire regex match; if the pattern is lazy, it will never match more than the absolute minimum of chars at that position.

      So "1/2 MILE" =~ m[1/2(?>\s*)(?!MILE)] should work.

Re: regex negative lookahead behaviour
by sauoq (Abbot) on Jul 18, 2003 at 17:15 UTC

    You have to remember that the regex engine tries as hard as it can to make the match work.

    "4 1/2 MILE RD" =~ m!(1/2)\s*(?!MILE)!;

    First, the engine slurps up all the space after the "1/2" with the \s* and finds that it doesn't match because of your assertion. Next, it backtracks by slurping all but one space with the \s*. At that point, your assertion "MILE" doesn't match " MILE" (with a leading space) so it is successful.

    -sauoq
    "My two cents aren't worth a dime.";
    
Re: regex negative lookahead behaviour
by flounder99 (Friar) on Jul 18, 2003 at 18:51 UTC
    I think your regex question was answered quite well but I wanted to remind you of some other hairy boundary conditions:
    1. A friend of mine lived in a duplex when he was in college and his address was "36 1/2 Garfield St." So if he lived on "4 1/2 MILE RD" his address would have been "36 1/2 4 1/2 MILE RD" -- YIPES!!!
    2. Is there a "1/2 MILE RD"? A duplex there would have addresses like "36 1/2 1/2 MILE RD" -- Double YIPES!!!
    3. What if someone builds a duplex on "MILES RD" What would your regex do with "36 1/2 MILES RD"? -- did i mention YIPES!!!
    You probably only have a fixed number of records to process and my suggestions are irrelevant but trying to process addresses in general is very hard.

    --

    flounder

      Yes, we have actually seen some pretty ugly stuff like you mention. Equally bad is "1/2 street" or even "1/8 street". I work for a private company that consolidates public record data for about 1/3 of wisconsin (and growing). Unfortunately, addresses are quite ugly, and we need to build special cases for junk like this thread is about.

      We currently have a few thousand lines of perl to parse addresses, and it's integrated with a CASS certified (U.S. postal certification) address correction software, but it's still a big pain.

      The worst part about the whole deal is that the government raw data is sometimes truly ambiguous, such as "814 E main st". There's generally no good way to tell whether the "E" should be a unit or a street directional. So then we try both cases against the address correction software, and hopefully only 1 comes back as a match.

      Not nearly as bad as a few years ago, when i was maintaining legacy C++ code to try to do this!

      Yes we do only have a fixed number of records, currently about 3 million.
Re: regex negative lookahead behaviour
by dws (Chancellor) on Jul 18, 2003 at 19:34 UTC
    I'm parsing snail mail addresses, particularly for horrible beasts like "456 4 1/2 MILE RD".

    Meta advice:

    I've been involved with two projects over the years that have had to parse names and addresses. In both cases, what we ended up with was a system that got ~98% right automatically, then kicked out the remaining 2% for vetting by a human.

    It's a diminishing returns problem: If you model the economics, at some point you have to stop pouring your effort into matching the remaining pool of difficult addresses, and let a human being do it.

      Unfortunately, dws's approach isn't always possible.

      The system I work on handles significant volumes of addresses, using dedicated (commercial) software to handle the identification and validation of that information.
      There is inhouse processing to help smooth things out, but not every problem can be catered for.

      The volumes involved prevent it being practial for humans to process the problem records (and in fact some of those problem records are a direct result of human input).

      Even if the volumes were low enough for humans to be able to process exceptions, humans can't get it right all of the time.
      This might be because of a lack of information, poorly laid of information or just human error.

      As shemp describes, even reference data used in such validation systems isn't perfect, and this sets the upper limit to what you can reasonably expect to acheieve.

      I'd say that you can never expect to get things 100% right, and it might end up being cheaper and/or easier to accept a certain error rate.
      Of course, your client may not accept this, but that's a whole other problem :-)

      Cheers.

      BazB


      If the information in this post is inaccurate, or just plain wrong, don't just downvote - please post explaining what's wrong.
      That way everyone learns.

Re: regex negative lookahead behaviour
by jonadab (Parson) on Jul 19, 2003 at 13:46 UTC

    This specific case has been answered for you quite adequately, but I just wanted to point out that you could have figured this out by looking at exactly WHAT the regex was matching. If you'd written a test script to quote and output what was matched, you'd have seen that the space wasn't matched, which might have led you to the answer faster than waiting for the monks to reply.


    Quidquid latine dictum sit altum viditur.
Re: regex negative lookahead behaviour
by princepawn (Parson) on Jul 18, 2003 at 18:10 UTC
    The only thing that overrides a regexes natural greediness is it's desire to achieve a match. If it can, it will, and it can, so it does:)
    How about putting a +after the quantifier?

    Carter's compass: I know I'm on the right track when by deleting something, I'm adding functionality

      How about putting a +after the quantifier?

      Then it breaks as soon as someone typos an extra space into the address. So, you can only do that if you can assume spaces have been normalized and there will be only one... But, if you can make that assumption, then there'd be no reason for a quantifier at all. :-)

      -sauoq
      "My two cents aren't worth a dime.";
      

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://275679]
Approved by broquaint
Front-paged by diotalevi
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others goofing around in the Monastery: (5)
As of 2024-04-19 13:25 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found