Re: A regex that only matches at offset that are multiples of a given N?
by tobyink (Canon) on Feb 13, 2013 at 07:52 UTC
|
use v5.12;
my @should_match = (
q[foo],
q[WXYZfoo],
q[WXYZWXYZfoo],
q[WXYZ WXYZfoo],
);
my @should_not_match = (
q[ foo],
q[ABCfoo],
q[VWXYZfoo],
q[WXYZWXYZWXYZAfoo],
);
my $regexp = qr{^(?:.{4})*foo};
say /$regexp/ ? "ok" : "not ok" for @should_match;
say /$regexp/ ? "not ok" : "ok" for @should_not_match;
package Cow { use Moo; has name => (is => 'lazy', default => sub { 'Mooington' }) } say Cow->new->name
| [reply] [Watch: Dir/Any] [d/l] [select] |
|
$a = join'',map{ ('a'..'z')[rand 26] } 1 .. 1000;;
print "$-[0]: $1" while $a =~ m[(?:.{4})*(?=(aa..))]g;;
0: aawx
404: aawx
405: aadz
481: aadz
print "$-[0]: $1" while $a =~ m[(?:.{4})*(?=(gg..))]g;;
0: gghn
208: gghn
211: ggyj
955: ggyj
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [Watch: Dir/Any] [d/l] |
|
use v5.12;
my $a = join '', map{ ('a'..'j')[rand 10] } 1 .. 1000;
while ($a =~ m[(aa..)]g)
{
next if pos($a) % 4;
say "Match at ", pos($a), ": ", $1;
}
It's possible that (?(cond)yes-expr|no-expr) might be able to do what you want, but I've not had much luck with that.
Update: with the hints about pos % 4 below, I've managed to get (?(cond)yes-expr|no-expr) to work. Not sure how well it goes performance-wise in practice, but:
use v5.12;
my $a = join '', map{ ('a'..'j')[rand 10] } 1 .. 1000;;
while ($a =~ m[(?(?{ pos() % 4 })(*F)|(aa..))]g)
{
say "Match at ", pos($a), ": ", $1;
}
package Cow { use Moo; has name => (is => 'lazy', default => sub { 'Mooington' }) } say Cow->new->name
| [reply] [Watch: Dir/Any] [d/l] [select] |
|
|
|
Why have you removed the ^ anchor?
| [reply] [Watch: Dir/Any] [d/l] |
|
|
You're just missing a leading \G
| [reply] [Watch: Dir/Any] [d/l] |
Re: A regex that only matches at offset that are multiples of a given N?
by johngg (Canon) on Feb 13, 2013 at 10:56 UTC
|
Similar to tobyink's solution but doing a global match in case there are multiple "fred"s on your intervals. It seems to work except that it matches twice at each point, probably for similar reasons as explored here.
]$ perl -Mstrict -Mwarnings -E '
$_ = q{abcdefredfghfredijklmnopfredqrs};
for my $n ( 4, 5 )
{
say qq{\$n = $n};
say qq{Matched $1 at position @{ [ pos( $_ ) ] }} while
m{\G(?:.{$n})*?(?=(fred.*))}g;
}'
$n = 4
Matched fredijklmnopfredqrs at position 12
Matched fredijklmnopfredqrs at position 12
Matched fredqrs at position 24
Matched fredqrs at position 24
$n = 5
Matched fredfghfredijklmnopfredqrs at position 5
Matched fredfghfredijklmnopfredqrs at position 5
$
I hope this is useful.
| [reply] [Watch: Dir/Any] [d/l] |
|
print "$-[0]: $1" while $a =~ m[\G(?:.{4})*?(?=(aa..))]g;;
0: aawx
404: aawx
print "$-[0]: $1" while $a =~ m[\G(?:.{4})*?(?=(gg..))]g;;
0: gghn
208: gghn
(*)I wasn't seeing the double matching; but now I am. Then I thought moving the \G would fix it, but it doesn't :( )
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [Watch: Dir/Any] [d/l] |
|
Maybe you could avoid the double matching, by manually incrementing pos() after each successful match - either inside the body of the while loop, or at the end of the regex using (?{pos() += 4}) ?
(Just an idea, haven't tested it.)
| [reply] [Watch: Dir/Any] [d/l] |
|
|
|
Re: A regex that only matches at offset that are multiples of a given N?
by johngg (Canon) on Feb 13, 2013 at 23:16 UTC
|
I think smls might be on the right track with moving pos. This benchmark seems to show that there is, if anything, a performance gain. I tried a modification to my first attempt my removing the duplicate matches via a hash but it only showed a marginal improvement in performance.
Then again, I am not very good at benchmarks so I could have cocked it up :-/
use strict;
use warnings;
use 5.014;
use Benchmark qw{ cmpthese };
my $n = 4;
my $str = q{x} x 50;
substr $str, $_, 4, q{fred} for 4, 9, 20, 24, 31, 40;
say qq{String: $str\n};
my $rcMovePos = sub {
my $raMatches;
while ( $str =~ m{\G(?:.{$n})*?(?=(fred.*))}g )
{
push @{ $raMatches }, [ pos( $str ), $1 ];
pos $str += $n;
}
return $raMatches;
};
my $rcNoDups = sub {
my $rhMatches;
$rhMatches->{ pos( $str ) } = $1 while
$str =~ m{\G(?:.{$n})*?(?=(fred.*))}g;
return $rhMatches;
};
my $rcWithDups = sub {
my $raMatches;
push @{ $raMatches }, [ pos( $str ), $1 ] while
$str =~ m{\G(?:.{$n})*?(?=(fred.*))}g;
return $raMatches;
};
my $raRes = $rcMovePos->();
say q{Using $rcMovePos};
say qq{ Matched $_->[ 1 ] at position $_->[ 0 ]} for @{ $raRes };
my $rhRes = $rcNoDups->();
say q{Using $rcNoDups};
say qq{ Matched $rhRes->{ $_ } at position $_} for
sort { $a <=> $b } keys %{ $rhRes };
$raRes = $rcWithDups->();
say q{Using $rcWithDups};
say qq{ Matched $_->[ 1 ] at position $_->[ 0 ]} for @{ $raRes };
srand 1234567890;
$str = q{x} x 10000;
substr $str, int rand 9997, 4, q{fred} for 1 .. 50;
say q{};
cmpthese(
-5,
{
movePos => $rcMovePos,
noDups => $rcNoDups,
withDups => $rcWithDups,
} );
String: xxxxfredxfredxxxxxxxfredfredxxxfredxxxxxfredxxxxxx
Using $rcMovePos
Matched fredxfredxxxxxxxfredfredxxxfredxxxxxfredxxxxxx at position 4
Matched fredfredxxxfredxxxxxfredxxxxxx at position 20
Matched fredxxxfredxxxxxfredxxxxxx at position 24
Matched fredxxxxxx at position 40
Using $rcNoDups
Matched fredxfredxxxxxxxfredfredxxxfredxxxxxfredxxxxxx at position 4
Matched fredfredxxxfredxxxxxfredxxxxxx at position 20
Matched fredxxxfredxxxxxfredxxxxxx at position 24
Matched fredxxxxxx at position 40
Using $rcWithDups
Matched fredxfredxxxxxxxfredfredxxxfredxxxxxfredxxxxxx at position 4
Matched fredxfredxxxxxxxfredfredxxxfredxxxxxfredxxxxxx at position 4
Matched fredfredxxxfredxxxxxfredxxxxxx at position 20
Matched fredfredxxxfredxxxxxfredxxxxxx at position 20
Matched fredxxxfredxxxxxfredxxxxxx at position 24
Matched fredxxxfredxxxxxfredxxxxxx at position 24
Matched fredxxxxxx at position 40
Matched fredxxxxxx at position 40
Rate withDups noDups movePos
withDups 2321/s -- -3% -33%
noDups 2394/s 3% -- -31%
movePos 3445/s 48% 44% --
I hope this is of interest.
| [reply] [Watch: Dir/Any] [d/l] [select] |
Re: A regex that only matches at offset that are multiples of a given N?
by smls (Friar) on Feb 13, 2013 at 23:35 UTC
|
Alright, here is a revised version of johngg's solution, that prevents the redundant double matching, while at the same time keeping the matching logic self-contained within the regex.
Instead of moving pos() forward manually (like I suggested in the discussion thread for johngg's solution), it lets the regex engine do this implicitly by having it gobble up $n characters (if available) after matching the zero-width look-ahead that contains the capture group:
# 0 5 10 15 20 25 30 35
# ' ' ' ' ' ' ' '
$_ = q{.....fred1..fred2...fred3....fred4..};
# ----++++----||||----||||----++++---- $n = 4
# -----||||+-----+++++||||-+++++-----+ $n = 5
my $capture = qr([0-9]\.+); # the (....) in the OP's specification
for my $n ( 4, 5 )
{
say "\$n = $n";
while ( m[\G(?:.{$n})*?(?=fred($capture)).{0,$n}]g ) {
say " matched 'fred$1' at pos @{[pos($_)-$n]} (gobbled '$&')";
}
}
Output:
$n = 4
matched 'fred2...' at pos 12 (gobbled '.....fred1..fred')
matched 'fred3....' at pos 20 (gobbled '2...fred')
$n = 5
matched 'fred1..' at pos 5 (gobbled '.....fred1')
matched 'fred3....' at pos 20 (gobbled '..fred2...fred3')
Note that if length("fred$1") > $n, it will actually start looking for the next "fred" while still whithin the part matched by $1. If this must be avoided, I guess manual pos()-incrementing is still the best bet. | [reply] [Watch: Dir/Any] [d/l] [select] |
|
That is really very clever. Thank you. (I'll get around to trying it out in my real application later and let you know how I get on.)
I also really like your test methodology. Mixing the freds at different multiple boundaries within the same string is a very neat way of testing.
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [Watch: Dir/Any] |
|
Mixing the freds at different multiple boundaries within the same string is a very neat way of testing.
I can't take credit for that, it was copied from johngg's answer. I just made it a little more readable.
| [reply] [Watch: Dir/Any] |
Re: A regex that only matches at offset that are multiples of a given N?
by ikegami (Patriarch) on Feb 14, 2013 at 07:55 UTC
|
/\G(?:.{4})*(?=fred(....))/sg
will work if you're okay with overlapping matches. If not, you could resort to using the slower
/fred(?(?{ pos % 4 != 0 })(?!))(....)/sg
| [reply] [Watch: Dir/Any] [d/l] [select] |
|
The first one will not work. It's identical to johngg's solution, except missing the non-greedy quantifier after the first parenthesis. Thus it will only find the last match rather than all of them (and it will return it twice).
Your second regex works, and is probably the semantically cleanest solution posted so far. It also shouldn't be that slow, especially if the ratio of "fred" occurrences to the total length of the string is low. It can be generalized to arbitrary $n like this (note that the "!= 0" is redundant):
/fred(?(?{ (pos()-4) % $n })(?!))($capture)/g
However, if this regex is reused for multiple values of $n which is declared with my in the parent scope, it seems to keep using the first one (like a closure). Declaring the $n with our seems to fix this. | [reply] [Watch: Dir/Any] [d/l] [select] |
|
Your second regex works,
I'm not seeing that:
use strict;
use warnings;
use 5.010;
# 0 5 10 15 20 25 30 35
# ' ' ' ' ' ' ' '
$_ = q{....fred1....fred2...fred3....fred4..};
while (/fred(?(?{ pos % 4 != 0 })(?!))(....)/sg) {
}
--output:--
Use of uninitialized value in numeric ne (!=) at (re_eval 1) line 1.
Use of uninitialized value in numeric ne (!=) at (re_eval 1) line 1.
Use of uninitialized value in numeric ne (!=) at (re_eval 1) line 1.
| [reply] [Watch: Dir/Any] [d/l] |
|
|
fred
(?
(?{ pos % 4 != 0 })
(?!)
)
(....)
What does this construct do:
(?stuff)
I can't find anything about that in perlre.
Edit: Ah, the boolean test finally clued me in:
(?(condition)yes-pattern)
| [reply] [Watch: Dir/Any] [d/l] [select] |
|
| [reply] [Watch: Dir/Any] |
Re: A regex that only matches at offset that are multiples of a given N?
by 7stud (Deacon) on Feb 16, 2013 at 05:41 UTC
|
Help!
If I use the construct (?(condition)yes-pattern) in a regex, and the condition is (?{1}), i.e. always true, the output is as expected:
my $str = 'bxAybz';
while ( $str =~ /(?(?{1})(b[xyz]))/g ) {
say 'yes';
say $1;
}
--output:--
yes
bx
yes
bz
But when I use the condition (?{pos() % 2 == 0}), I expect the same output, but I don't get it:
my $str = 'bxAybz';
while ( $str =~ /(?(?{pos() % 2 == 0})(b[xyz]))/g ) {
say 'yes';
say $1;
}
--output:--
yes
bx
yes
Use of uninitialized value $1 in say at 2.pl line 9.
yes
bz
Three matches?
Also, I notice the x modifier doesn't work with a conditional pattern:
my $str = 'bxAybz';
while ( $str =~ /
(?
(?{1})
(b[xyz])
)
/gx ) {
say 'yes';
say $1;
}
--output:--
Sequence (?
...) not recognized in regex; marked by <-- HERE in m/
(?
<-- HERE (?{1})
(b[xyz])
)
/ at 2.pl line 12.
| [reply] [Watch: Dir/Any] [d/l] [select] |
|
First, I'm not the right person to be asking this question of. Whatever regex expertise I once had is well out of date. There is a whole bunch of stuff I've never done anyting with. However, this is my interpretation of the decidedly unclear documentation:
Three matches?
The (what appears to be called) zero-length switch assertion, appears to succeed, whenever the condition part succeeds; regardless of whether the yes pattern (or no pattern, if present) succeed.
So with 6 characters in your string, and a condition that restricts matching to every second character, the overall match succeeds 3 times, even if the yes pattern only matches at 2 of those positions. Hence your output.
Also, I notice the x modifier doesn't work with a conditional pattern:
It appears that you are breaking up an indivisible token with the whitespace you've used. This works:
C:\test>perl -M5.010 -w
my $str = 'bxAybz';
while ( $str =~ /
(?(?{1})
(b[xyz])
)
/gx ) {
say 'yes';
say $1;
}
^Z
yes
bx
yes
bz
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [Watch: Dir/Any] [d/l] |
|
/(?(?{pos() % 2 == 0})(b[xyz])(*FAIL))/g
package Cow { use Moo; has name => (is => 'lazy', default => sub { 'Mooington' }) } say Cow->new->name
| [reply] [Watch: Dir/Any] [d/l] [select] |