regexp: Mind-boggling negative assertions...

Robidu has asked for the wisdom of the Perl Monks concerning the following question:

Greetings!

I'm attempting to check an URL with a regexp on whether or not it matches certain criteria (that is, an extremely strict check that is supposed to only let particular URLs pass and reject the rest). However, for some reason it doesn't work as expected.

Here's the code sample in question:

#!/usr/bin/perl -w

use strict;

my $referrer = 'https://www.robidu.de/';

if($referrer =~ /^https?:\/\/(?!www\.)robidu\.de\//)
  {
  print "Match!\n";
  }
else
  { 
  print "No match!\n";
  }
[download]

The URL as given in this sample correctly causes a nonmatch. Removing "www." from the URL in turn correctly gives a match of the regexp.

However, when the "www." in the example is replaced by, for example, "forum." and the check run again, it indicates a nonmatch (and subsequently lets the URL pass) instead of matching and thereby rejecting it.

So what could possibly be going wrong here? Any help would be greatly appreciated.

Comment on regexp: Mind-boggling negative assertions... Download Code

Replies are listed 'Best First'.
Re: regexp: Mind-boggling negative assertions... by stevieb (Canon) on Aug 17, 2015 at 00:02 UTC
This is because a negative lookahead (`(?!)`) is a zero-width assertion, and doesn't capture anything, nor does it take up any space at all. That means that if the part isn't 'www.', it effectively is blank space so nothing will ever match there. Try this, and go from there `/^https?:\/\/.*?\.?(?<!www\.)robidu\.de\//;` [download] It checks for anything non-greedy up to an optional dot, but if what is prior to the domain name is "www.", it's a no match. This uses a negative lookbehind, which is also a zero-width assertion (same as a negative lookahead), but the difference is that it can see anything else there prior to the lookbehind. With that said, if you're wanting to allow only certain names, you might consider something like this, instead of looking negatively: `my $r = 'https://forum.robidu.de/'; my @allowed = qw(ww2 forum); my $re = join('\|', @allowed); print $r =~ /^https?:\/\/$re\.?robidu\.de\//;` [download] -stevieb	[reply] [d/l] [select]
Re: regexp: Mind-boggling negative assertions... by kevbot (Vicar) on Aug 17, 2015 at 00:45 UTC
++stevieb for helpful suggestions. Also, Anonymous Monk makes a good point that a list of test cases would be very helpful. That is, the code that stevieb posted works for the cases that you have mentioned in your post and replies...but it may fail on other cases that you have failed to mention here. This is a case where the YAPE::Regex::Explain module can be helpful (see item 9 in Basic debugging checklist). Here is the YAPE::Regex::Explain output for your original regex: Read more... (2 kB) Here is the YAPE::Regex::Explain output for stevieb's regex: Read more... (3 kB)	[reply] [d/l] [select]
Re^2: regexp: Mind-boggling negative assertions... by stevieb (Canon) on Aug 17, 2015 at 00:59 UTC
Another really nice feature is `use re 'debug';`: `#!/usr/bin/perl use warnings; use strict; use re 'debug'; "this" =~ /(?<!x)i?/;` [download] Output: `Compiling REx "(?<!x)i?" Final program: 1: UNLESSM[-1] (7) 3: EXACT <x> (5) 5: SUCCEED (0) 6: TAIL (7) 7: CURLY {0,1} (11) 9: EXACT <i> (0) 11: END (0) minlen 0 Matching REx "(?<!x)i?" against "this" 0 <> <this> \| 1:UNLESSM[-1](7) 0 <> <this> \| 7:CURLY {0,1}(11) EXACT <i> can match 0 times out of 1 +... 0 <> <this> \| 11: END(0) Match successful! Freeing REx: "(?<!x)i?"` [download]	[reply] [d/l] [select]
Re: regexp: Mind-boggling negative assertions... by GrandFather (Saint) on Aug 17, 2015 at 06:40 UTC
I don't see a need for a look ahead. Try: `#!/usr/bin/perl use warnings; use strict; for my $test ( 'https://www.robidu.de/', 'http://www.robidu.de/', 'https://robidu.de/', 'https://forum.robidu.de/' ) { if ($test =~ m~^https?://(?:www\.)?robidu\.de/~) { print "'$test' matched\n"; } else { print "'$test' didn't match\n"; } }` [download] Prints: `'https://www.robidu.de/' matched 'http://www.robidu.de/' matched 'https://robidu.de/' matched 'https://forum.robidu.de/' didn't match` [download] Premature optimization is the root of all job security	[reply] [d/l] [select]
Re: regexp: Mind-boggling negative assertions... by soonix (Canon) on Aug 17, 2015 at 08:20 UTC
an extremely strict check that is supposed to only let particular URLs pass and reject the rest I think you could make your life much easier if you'd reverse the logic of your script: similiar to: `if ($referrer =~ m!^https?://(forum\.)?robidu\.de/!) { print "pass\n"; } else { print "reject\n"; }` [download] That way you can concentrate on what you want instead of what to avoid.	[reply] [d/l]
Re: regexp: Mind-boggling negative assertions... by Anonymous Monk on Aug 16, 2015 at 23:17 UTC
`if($referrer =~ /^https?:\/\/(?!www\.).*robidu.de\//)` [download]	[reply] [d/l]
Re^2: regexp: Mind-boggling negative assertions... by Anonymous Monk on Aug 16, 2015 at 23:29 UTC
If you don't like this solution, please give a list of URLs as test cases (maybe 10 or more) and clearly mark which should match (the complete regex) and which should fail.	[reply]
Re^2: regexp: Mind-boggling negative assertions... by Robidu (Acolyte) on Aug 16, 2015 at 23:29 UTC
That's only a partial solution of the problem. "www.forum.robidu.de" still causes a nonmatch, although that is an address that is supposed to be rejected as well. It's supposed to be an extremely strict rule that lets pass only one particular address (for both http and https).	[reply]
Re^3: regexp: Mind-boggling negative assertions... by Anonymous Monk on Aug 16, 2015 at 23:33 UTC
A non-match is a rejection. Your English is too confusing. Please give an extensive list of URLs that cover all cases, and indicate whether they pass or fail.	[reply]
Re^4: regexp: Mind-boggling negative assertions... by Robidu (Acolyte) on Aug 17, 2015 at 00:06 UTC
Re^5: regexp: Mind-boggling negative assertions... by Anonymous Monk on Aug 17, 2015 at 00:09 UTC
Re^3: regexp: Mind-boggling negative assertions... by Anonymous Monk on Aug 16, 2015 at 23:54 UTC
Of course "www.forum.robidu.de" is rejected/nonmatch. It is missing "http://".	[reply]


Come for the quick hacks, stay for the epiphanies.
	PerlMonks