about word boundary in RE

anaconda_wly has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: about word boundary in RE by hdb (Monsignor) on Apr 02, 2013 at 07:24 UTC
The whitespace between the words prevents the match. `my $test = "asd asd"; if ($test =~ /(.+\b)\s\1/) { print "Found $1 repeated\n"; }` [download]	[reply] [d/l]
Re^2: about word boundary in RE by anaconda_wly (Scribe) on Apr 02, 2013 at 07:52 UTC
Isn't there a word boundary before the whitespace? If (.+\b) already match, why I need the \s? I thought the output will be the first "asd" but not.	[reply]
Re: about word boundary in RE (use re 'debug') by Anonymous Monk on Apr 02, 2013 at 07:36 UTC
Add `use re 'debug';` to that short file, and watch the regex engine do its thing in your console	[reply] [d/l]
Re^2: about word boundary in RE (use re 'debug') by anaconda_wly (Scribe) on Apr 02, 2013 at 07:53 UTC
Good but seems not easily readable to me. If (.+\b) already match, why I need the \s?	[reply]
Re^3: about word boundary in RE (use re 'debug') by hdb (Monsignor) on Apr 02, 2013 at 08:00 UTC
\b is a zero width match. It does need the space to recognize a word boundary, but it does not consume it. And therefore you need to add a space to your pattern.	[reply]
Re^4: about word boundary in RE (use re 'debug') by anaconda_wly (Scribe) on Apr 02, 2013 at 08:49 UTC
Re^3: about word boundary in RE (use re 'debug') by Anonymous Monk on Apr 02, 2013 at 08:21 UTC
Good but seems not easily readable to me. In that case, use a shorter string, associate the numbers from "Final program" against those on the right side , like 1: OPEN1 (3) $ perl -Mre=debug -le " q/a a/ =~ /(.\b)\1/ " Compiling REx "(.\b)\1" Final program: 1: OPEN1 (3) 3: REG_ANY (4) 4: BOUND (5) 5: CLOSE1 (7) 7: REF1 (9) 9: END (0) minlen 1 Matching REx "(.\b)\1" against "a a" 0 <> <a a> \| 1:OPEN1(3) 0 <> <a a> \| 3:REG_ANY(4) 1 <a> < a> \| 4:BOUND(5) 1 <a> < a> \| 5:CLOSE1(7) 1 <a> < a> \| 7:REF1(9) failed... 1 <a> < a> \| 1:OPEN1(3) 1 <a> < a> \| 3:REG_ANY(4) 2 <a > <a> \| 4:BOUND(5) 2 <a > <a> \| 5:CLOSE1(7) 2 <a > <a> \| 7:REF1(9) failed... 2 <a > <a> \| 1:OPEN1(3) 2 <a > <a> \| 3:REG_ANY(4) 3 <a a> <> \| 4:BOUND(5) 3 <a a> <> \| 5:CLOSE1(7) 3 <a a> <> \| 7:REF1(9) failed... 3 <a a> <> \| 1:OPEN1(3) 3 <a a> <> \| 3:REG_ANY(4) failed... Match failed Freeing REx: "(.\b)\1" [download] Compare against a simpler pattern like `$ perl -Mre=debug -le " q/aa/ =~ /a\b/ " Compiling REx "a\b" Final program: 1: EXACT <a> (3) 3: BOUND (4) 4: END (0) anchored "a" at 0 (checking anchored) minlen 1 Guessing start of match in sv for REx "a\b" against "aa" Found anchored substr "a" at offset 0... Guessed: match at offset 0 Matching REx "a\b" against "aa" 0 <> <aa> \| 1:EXACT <a>(3) 1 <a> <a> \| 3:BOUND(4) failed... 1 <a> <a> \| 1:EXACT <a>(3) 2 <aa> <> \| 3:BOUND(4) 2 <aa> <> \| 4:END(0) Match successful! Freeing REx: "a\b"` [download] Then check the definition of \b in perlre#Assertions, perlrequick Perl defines the following zero-width assertions: The word anchor \b matches a boundary between a word character and a non-word character \w\W or \W\w `$x = "Housecat catenates house and cat"; $x =~ /\bcat/; # matches cat in 'catenates' $x =~ /cat\b/; # matches cat in 'housecat' $x =~ /\bcat\b/; # matches 'cat' at end of string` [download] Basically your pattern can never match, just like this `perl -Mre=debug -le " q/aa/ =~ /a\ba/ "` there can never be a word boundary within a word by definition	[reply] [d/l] [select]
Re^4: about word boundary in RE (use re 'debug') by anaconda_wly (Scribe) on Apr 02, 2013 at 08:53 UTC


Perl-Sensitive Sunglasses
	PerlMonks