Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

A regex that only matches at offset that are multiples of a given N?

by BrowserUk (Pope)
on Feb 13, 2013 at 07:34 UTC ( #1018488=perlquestion: print w/ replies, xml ) Need Help??
BrowserUk has asked for the wisdom of the Perl Monks concerning the following question:

Ie. If n = 4; then what to put in place of ???? in order that this m[(?:????{$n})(?:=fred(....))] will only match fred at position 0, 4, 8, 12, ...?


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

Comment on A regex that only matches at offset that are multiples of a given N?
Download Code
Re: A regex that only matches at offset that are multiples of a given N?
by tobyink (Abbot) on Feb 13, 2013 at 07:52 UTC

    I'd put . in place of ???? and then * after the first parentheses...

    use v5.12; my @should_match = ( q[foo], q[WXYZfoo], q[WXYZWXYZfoo], q[WXYZ WXYZfoo], ); my @should_not_match = ( q[ foo], q[ABCfoo], q[VWXYZfoo], q[WXYZWXYZWXYZAfoo], ); my $regexp = qr{^(?:.{4})*foo}; say /$regexp/ ? "ok" : "not ok" for @should_match; say /$regexp/ ? "not ok" : "ok" for @should_not_match;
    package Cow { use Moo; has name => (is => 'lazy', default => sub { 'Mooington' }) } say Cow->new->name
      wish it were so simple:

      $a = join'',map{ ('a'..'z')[rand 26] } 1 .. 1000;; print "$-[0]: $1" while $a =~ m[(?:.{4})*(?=(aa..))]g;; 0: aawx 404: aawx 405: aadz 481: aadz print "$-[0]: $1" while $a =~ m[(?:.{4})*(?=(gg..))]g;; 0: gghn 208: gghn 211: ggyj 955: ggyj

      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        Why have you removed the ^ anchor?
        لսႽ ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ

        Ah, you're using /g... in that case I'd just...

        use v5.12; my $a = join '', map{ ('a'..'j')[rand 10] } 1 .. 1000; while ($a =~ m[(aa..)]g) { next if pos($a) % 4; say "Match at ", pos($a), ": ", $1; }

        It's possible that (?(cond)yes-expr|no-expr) might be able to do what you want, but I've not had much luck with that.

        Update: with the hints about pos % 4 below, I've managed to get (?(cond)yes-expr|no-expr) to work. Not sure how well it goes performance-wise in practice, but:

        use v5.12; my $a = join '', map{ ('a'..'j')[rand 10] } 1 .. 1000;; while ($a =~ m[(?(?{ pos() % 4 })(*F)|(aa..))]g) { say "Match at ", pos($a), ": ", $1; }
        package Cow { use Moo; has name => (is => 'lazy', default => sub { 'Mooington' }) } say Cow->new->name
        You're just missing a leading \G
Re: A regex that only matches at offset that are multiples of a given N?
by johngg (Abbot) on Feb 13, 2013 at 10:56 UTC

    Similar to tobyink's solution but doing a global match in case there are multiple "fred"s on your intervals. It seems to work except that it matches twice at each point, probably for similar reasons as explored here.

    ]$ perl -Mstrict -Mwarnings -E ' $_ = q{abcdefredfghfredijklmnopfredqrs}; for my $n ( 4, 5 ) { say qq{\$n = $n}; say qq{Matched $1 at position @{ [ pos( $_ ) ] }} while m{\G(?:.{$n})*?(?=(fred.*))}g; }' $n = 4 Matched fredijklmnopfredqrs at position 12 Matched fredijklmnopfredqrs at position 12 Matched fredqrs at position 24 Matched fredqrs at position 24 $n = 5 Matched fredfghfredijklmnopfredqrs at position 5 Matched fredfghfredijklmnopfredqrs at position 5 $

    I hope this is useful.

    Cheers,

    JohnGG

      Ah! Almost(*) perfect. (I never have wrapped my brain around \G :( )

      print "$-[0]: $1" while $a =~ m[\G(?:.{4})*?(?=(aa..))]g;; 0: aawx 404: aawx print "$-[0]: $1" while $a =~ m[\G(?:.{4})*?(?=(gg..))]g;; 0: gghn 208: gghn

      (*)I wasn't seeing the double matching; but now I am. Then I thought moving the \G would fix it, but it doesn't :( )


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

        Maybe you could avoid the double matching, by manually incrementing pos() after each successful match - either inside the body of the while loop, or at the end of the regex using (?{pos() += 4}) ?

        (Just an idea, haven't tested it.)

Re: A regex that only matches at offset that are multiples of a given N?
by johngg (Abbot) on Feb 13, 2013 at 23:16 UTC

    I think smls might be on the right track with moving pos. This benchmark seems to show that there is, if anything, a performance gain. I tried a modification to my first attempt my removing the duplicate matches via a hash but it only showed a marginal improvement in performance.

    Then again, I am not very good at benchmarks so I could have cocked it up :-/

    use strict; use warnings; use 5.014; use Benchmark qw{ cmpthese }; my $n = 4; my $str = q{x} x 50; substr $str, $_, 4, q{fred} for 4, 9, 20, 24, 31, 40; say qq{String: $str\n}; my $rcMovePos = sub { my $raMatches; while ( $str =~ m{\G(?:.{$n})*?(?=(fred.*))}g ) { push @{ $raMatches }, [ pos( $str ), $1 ]; pos $str += $n; } return $raMatches; }; my $rcNoDups = sub { my $rhMatches; $rhMatches->{ pos( $str ) } = $1 while $str =~ m{\G(?:.{$n})*?(?=(fred.*))}g; return $rhMatches; }; my $rcWithDups = sub { my $raMatches; push @{ $raMatches }, [ pos( $str ), $1 ] while $str =~ m{\G(?:.{$n})*?(?=(fred.*))}g; return $raMatches; }; my $raRes = $rcMovePos->(); say q{Using $rcMovePos}; say qq{ Matched $_->[ 1 ] at position $_->[ 0 ]} for @{ $raRes }; my $rhRes = $rcNoDups->(); say q{Using $rcNoDups}; say qq{ Matched $rhRes->{ $_ } at position $_} for sort { $a <=> $b } keys %{ $rhRes }; $raRes = $rcWithDups->(); say q{Using $rcWithDups}; say qq{ Matched $_->[ 1 ] at position $_->[ 0 ]} for @{ $raRes }; srand 1234567890; $str = q{x} x 10000; substr $str, int rand 9997, 4, q{fred} for 1 .. 50; say q{}; cmpthese( -5, { movePos => $rcMovePos, noDups => $rcNoDups, withDups => $rcWithDups, } );
    String: xxxxfredxfredxxxxxxxfredfredxxxfredxxxxxfredxxxxxx Using $rcMovePos Matched fredxfredxxxxxxxfredfredxxxfredxxxxxfredxxxxxx at position 4 Matched fredfredxxxfredxxxxxfredxxxxxx at position 20 Matched fredxxxfredxxxxxfredxxxxxx at position 24 Matched fredxxxxxx at position 40 Using $rcNoDups Matched fredxfredxxxxxxxfredfredxxxfredxxxxxfredxxxxxx at position 4 Matched fredfredxxxfredxxxxxfredxxxxxx at position 20 Matched fredxxxfredxxxxxfredxxxxxx at position 24 Matched fredxxxxxx at position 40 Using $rcWithDups Matched fredxfredxxxxxxxfredfredxxxfredxxxxxfredxxxxxx at position 4 Matched fredxfredxxxxxxxfredfredxxxfredxxxxxfredxxxxxx at position 4 Matched fredfredxxxfredxxxxxfredxxxxxx at position 20 Matched fredfredxxxfredxxxxxfredxxxxxx at position 20 Matched fredxxxfredxxxxxfredxxxxxx at position 24 Matched fredxxxfredxxxxxfredxxxxxx at position 24 Matched fredxxxxxx at position 40 Matched fredxxxxxx at position 40 Rate withDups noDups movePos withDups 2321/s -- -3% -33% noDups 2394/s 3% -- -31% movePos 3445/s 48% 44% --

    I hope this is of interest.

    Cheers,

    JohnGG

Re: A regex that only matches at offset that are multiples of a given N?
by smls (Friar) on Feb 13, 2013 at 23:35 UTC

    Alright, here is a revised version of johngg's solution, that prevents the redundant double matching, while at the same time keeping the matching logic self-contained within the regex.

    Instead of moving pos() forward manually (like I suggested in the discussion thread for johngg's solution), it lets the regex engine do this implicitly by having it gobble up $n characters (if available) after matching the zero-width look-ahead that contains the capture group:

    # 0 5 10 15 20 25 30 35 # ' ' ' ' ' ' ' ' $_ = q{.....fred1..fred2...fred3....fred4..}; # ----++++----||||----||||----++++---- $n = 4 # -----||||+-----+++++||||-+++++-----+ $n = 5 my $capture = qr([0-9]\.+); # the (....) in the OP's specification for my $n ( 4, 5 ) { say "\$n = $n"; while ( m[\G(?:.{$n})*?(?=fred($capture)).{0,$n}]g ) { say " matched 'fred$1' at pos @{[pos($_)-$n]} (gobbled '$&')"; } }

    Output:

    $n = 4 matched 'fred2...' at pos 12 (gobbled '.....fred1..fred') matched 'fred3....' at pos 20 (gobbled '2...fred') $n = 5 matched 'fred1..' at pos 5 (gobbled '.....fred1') matched 'fred3....' at pos 20 (gobbled '..fred2...fred3')

    Note that if length("fred$1") > $n, it will actually start looking for the next "fred" while still whithin the part matched by $1. If this must be avoided, I guess manual pos()-incrementing is still the best bet.

      That is really very clever. Thank you. (I'll get around to trying it out in my real application later and let you know how I get on.)

      I also really like your test methodology. Mixing the freds at different multiple boundaries within the same string is a very neat way of testing.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        Mixing the freds at different multiple boundaries within the same string is a very neat way of testing.
        I can't take credit for that, it was copied from johngg's answer. I just made it a little more readable.
Re: A regex that only matches at offset that are multiples of a given N?
by ikegami (Pope) on Feb 14, 2013 at 07:55 UTC
    /\G(?:.{4})*(?=fred(....))/sg
    will work if you're okay with overlapping matches. If not, you could resort to using the slower
    /fred(?(?{ pos % 4 != 0 })(?!))(....)/sg

      The first one will not work. It's identical to johngg's solution, except missing the non-greedy quantifier after the first parenthesis. Thus it will only find the last match rather than all of them (and it will return it twice).
       

      Your second regex works, and is probably the semantically cleanest solution posted so far. It also shouldn't be that slow, especially if the ratio of "fred" occurrences to the total length of the string is low. It can be generalized to arbitrary $n like this (note that the "!= 0" is redundant):

      /fred(?(?{ (pos()-4) % $n })(?!))($capture)/g

      However, if this regex is reused for multiple values of $n which is declared with my in the parent scope, it seems to keep using the first one (like a closure). Declaring the $n with our seems to fix this.

        Your second regex works,

        I'm not seeing that:

        use strict; use warnings; use 5.010; # 0 5 10 15 20 25 30 35 # ' ' ' ' ' ' ' ' $_ = q{....fred1....fred2...fred3....fred4..}; while (/fred(?(?{ pos % 4 != 0 })(?!))(....)/sg) { } --output:-- Use of uninitialized value in numeric ne (!=) at (re_eval 1) line 1. Use of uninitialized value in numeric ne (!=) at (re_eval 1) line 1. Use of uninitialized value in numeric ne (!=) at (re_eval 1) line 1.
      In this regex:
      fred (? (?{ pos % 4 != 0 }) (?!) ) (....)

      What does this construct do:

      (?stuff)

      I can't find anything about that in perlre.

      Edit: Ah, the boolean test finally clued me in:

      (?(condition)yes-pattern)
        I can't find anything about that in perlre.

        perlre is not exactly optimized for quickly finding out the meaning of a specific regex construct.

        Use perlreref for that.

Re: A regex that only matches at offset that are multiples of a given N?
by 7stud (Deacon) on Feb 16, 2013 at 05:41 UTC

    Help!

    If I use the construct (?(condition)yes-pattern) in a regex, and the condition is (?{1}), i.e. always true, the output is as expected:

    my $str = 'bxAybz'; while ( $str =~ /(?(?{1})(b[xyz]))/g ) { say 'yes'; say $1; } --output:-- yes bx yes bz

    But when I use the condition (?{pos() % 2 == 0}), I expect the same output, but I don't get it:

    my $str = 'bxAybz'; while ( $str =~ /(?(?{pos() % 2 == 0})(b[xyz]))/g ) { say 'yes'; say $1; } --output:-- yes bx yes Use of uninitialized value $1 in say at 2.pl line 9. yes bz

    Three matches?

    Also, I notice the x modifier doesn't work with a conditional pattern:

    my $str = 'bxAybz'; while ( $str =~ / (? (?{1}) (b[xyz]) ) /gx ) { say 'yes'; say $1; } --output:-- Sequence (? ...) not recognized in regex; marked by <-- HERE in m/ (? <-- HERE (?{1}) (b[xyz]) ) / at 2.pl line 12.

      First, I'm not the right person to be asking this question of. Whatever regex expertise I once had is well out of date. There is a whole bunch of stuff I've never done anyting with. However, this is my interpretation of the decidedly unclear documentation:

      Three matches?

      The (what appears to be called) zero-length switch assertion, appears to succeed, whenever the condition part succeeds; regardless of whether the yes pattern (or no pattern, if present) succeed.

      So with 6 characters in your string, and a condition that restricts matching to every second character, the overall match succeeds 3 times, even if the yes pattern only matches at 2 of those positions. Hence your output.

      Also, I notice the x modifier doesn't work with a conditional pattern:

      It appears that you are breaking up an indivisible token with the whitespace you've used. This works:

      C:\test>perl -M5.010 -w my $str = 'bxAybz'; while ( $str =~ / (?(?{1}) (b[xyz]) ) /gx ) { say 'yes'; say $1; } ^Z yes bx yes bz

      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

      /(?(?{pos() % 2 == 0})(b[xyz]))/g

      I think you want to explicitly back-track in the no condition...

      /(?(?{pos() % 2 == 0})(b[xyz])(*FAIL))/g
      package Cow { use Moo; has name => (is => 'lazy', default => sub { 'Mooington' }) } say Cow->new->name

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1018488]
Approved by vinoth.ree
Front-paged by ww
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others romping around the Monastery: (9)
As of 2014-08-23 03:45 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The best computer themed movie is:











    Results (172 votes), past polls