Beefy Boxes and Bandwidth Generously Provided by pair Networks
Come for the quick hacks, stay for the epiphanies.
 
PerlMonks  

regexp: Mind-boggling negative assertions...

by Robidu (Acolyte)
on Aug 16, 2015 at 23:04 UTC ( [id://1138793]=perlquestion: print w/replies, xml ) Need Help??

Robidu has asked for the wisdom of the Perl Monks concerning the following question:

Greetings!

I'm attempting to check an URL with a regexp on whether or not it matches certain criteria (that is, an extremely strict check that is supposed to only let particular URLs pass and reject the rest). However, for some reason it doesn't work as expected.

Here's the code sample in question:

#!/usr/bin/perl -w use strict; my $referrer = 'https://www.robidu.de/'; if($referrer =~ /^https?:\/\/(?!www\.)robidu\.de\//) { print "Match!\n"; } else { print "No match!\n"; }

The URL as given in this sample correctly causes a nonmatch. Removing "www." from the URL in turn correctly gives a match of the regexp.

However, when the "www." in the example is replaced by, for example, "forum." and the check run again, it indicates a nonmatch (and subsequently lets the URL pass) instead of matching and thereby rejecting it.

So what could possibly be going wrong here? Any help would be greatly appreciated.

Replies are listed 'Best First'.
Re: regexp: Mind-boggling negative assertions...
by stevieb (Canon) on Aug 17, 2015 at 00:02 UTC

    This is because a negative lookahead ((?!)) is a zero-width assertion, and doesn't capture anything, nor does it take up any space at all. That means that if the part isn't 'www.', it effectively is blank space so nothing will ever match there.

    Try this, and go from there

    /^https?:\/\/.*?\.?(?<!www\.)robidu\.de\//;

    It checks for anything non-greedy up to an optional dot, but if what is prior to the domain name is "www.", it's a no match. This uses a negative lookbehind, which is also a zero-width assertion (same as a negative lookahead), but the difference is that it can see anything else there prior to the lookbehind.

    With that said, if you're wanting to allow only certain names, you might consider something like this, instead of looking negatively:

    my $r = 'https://forum.robidu.de/'; my @allowed = qw(ww2 forum); my $re = join('|', @allowed); print $r =~ /^https?:\/\/$re\.?robidu\.de\//;

    -stevieb

Re: regexp: Mind-boggling negative assertions...
by kevbot (Vicar) on Aug 17, 2015 at 00:45 UTC

    ++stevieb for helpful suggestions. Also, Anonymous Monk makes a good point that a list of test cases would be very helpful. That is, the code that stevieb posted works for the cases that you have mentioned in your post and replies...but it may fail on other cases that you have failed to mention here.

    This is a case where the YAPE::Regex::Explain module can be helpful (see item 9 in Basic debugging checklist).

    Here is the YAPE::Regex::Explain output for your original regex: Here is the YAPE::Regex::Explain output for stevieb's regex:

      Another really nice feature is use re 'debug';:

      #!/usr/bin/perl use warnings; use strict; use re 'debug'; "this" =~ /(?<!x)i?/;

      Output:

      Compiling REx "(?<!x)i?" Final program: 1: UNLESSM[-1] (7) 3: EXACT <x> (5) 5: SUCCEED (0) 6: TAIL (7) 7: CURLY {0,1} (11) 9: EXACT <i> (0) 11: END (0) minlen 0 Matching REx "(?<!x)i?" against "this" 0 <> <this> | 1:UNLESSM[-1](7) 0 <> <this> | 7:CURLY {0,1}(11) EXACT <i> can match 0 times out of 1 +... 0 <> <this> | 11: END(0) Match successful! Freeing REx: "(?<!x)i?"
Re: regexp: Mind-boggling negative assertions...
by GrandFather (Saint) on Aug 17, 2015 at 06:40 UTC

    I don't see a need for a look ahead. Try:

    #!/usr/bin/perl use warnings; use strict; for my $test ( 'https://www.robidu.de/', 'http://www.robidu.de/', 'https://robidu.de/', 'https://forum.robidu.de/' ) { if ($test =~ m~^https?://(?:www\.)?robidu\.de/~) { print "'$test' matched\n"; } else { print "'$test' didn't match\n"; } }

    Prints:

    'https://www.robidu.de/' matched 'http://www.robidu.de/' matched 'https://robidu.de/' matched 'https://forum.robidu.de/' didn't match
    Premature optimization is the root of all job security
Re: regexp: Mind-boggling negative assertions...
by soonix (Canon) on Aug 17, 2015 at 08:20 UTC
    an extremely strict check that is supposed to only let particular URLs pass and reject the rest
    I think you could make your life much easier if you'd reverse the logic of your script: similiar to:
    if ($referrer =~ m!^https?://(forum\.)?robidu\.de/!) { print "pass\n"; } else { print "reject\n"; }
    That way you can concentrate on what you want instead of what to avoid.
Re: regexp: Mind-boggling negative assertions...
by Anonymous Monk on Aug 16, 2015 at 23:17 UTC
    if($referrer =~ /^https?:\/\/(?!www\.).*robidu.de\//)

      If you don't like this solution, please give a list of URLs as test cases (maybe 10 or more) and clearly mark which should match (the complete regex) and which should fail.

      That's only a partial solution of the problem. "www.forum.robidu.de" still causes a nonmatch, although that is an address that is supposed to be rejected as well.

      It's supposed to be an extremely strict rule that lets pass only one particular address (for both http and https).

        A non-match *is* a rejection.

        Your English is too confusing. Please give an extensive list of URLs that cover all cases, and indicate whether they pass or fail.

        Of course "www.forum.robidu.de" is rejected/nonmatch. It is missing "http://".

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1138793]
Approved by kevbot
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others imbibing at the Monastery: (5)
As of 2024-04-19 20:07 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found