Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

Finding a repeating substring with a regex

by chargrill (Parson)
on Apr 15, 2006 at 08:01 UTC ( #543507=perlquestion: print w/ replies, xml ) Need Help??
chargrill has asked for the wisdom of the Perl Monks concerning the following question:

Hello.

I'm trying to expand my knowledge of regexen, especially more advanced uses for them, beyond simple things like matching e-mail address, for example ;)

I'm trying to find a repeated substring in a larger string. Given the following code:

#!/usr/bin/perl use strict; use warnings; my @found; # 0 1 2 3 4 5 # 012345678901234567890123456789012345678901234567890 $_ = q'abczdefzabcghijklzabczaerabrtyuabcethdauthabkudiabc'; # abc abc abc abc abc # 0 8 18 31 48 m< ( (?{ push @found, pos() }) abc # the string I know which repeats ) \G .+? \1 >gx; print join ', ', @found, "\n";

Which prints: 0, 8, 18, 31, 48, - exactly the positions I know the substring is in relative to the larger string.

But suppose I don't know ahead of time _what_ the repeating pattern is, but would still like to detect it? Leaving the rest of the code alone, I change the regex to:

m< ( (?{ push @found, pos() }) [a-z]{3} # suppose i'm looking for a series of 3 repeating char +acters... ) \G .+? \1 >gx;

This doesn't work at all as i'd hope; it prints 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47,

Can anyone help me out with at the very least a shove in the right direction? I've tried as many variations as I can think of, inserting variations of (?{ print pos(), $^N }) into various places, and just can't figure out what I need to change to achieve my desired results.



--chargrill
$,=42;for(34,0,-3,9,-11,11,-17,7,-5){$*.=pack'c'=>$,+=$_}for(reverse s +plit//=>$* ){$%++?$ %%2?push@C,$_,$":push@c,$_,$":(push@C,$_,$")&&push@c,$"}$C[$# +C]=$/;($#C >$#c)?($ c=\@C)&&($ C=\@c):($ c=\@c)&&($C=\@C);$%=$|;for(@$c){print$_^ +$$C[$%++]}

Comment on Finding a repeating substring with a regex
Select or Download Code
Re: Finding a repeating substring with a regex
by kvale (Monsignor) on Apr 15, 2006 at 08:29 UTC
    First you must find the repeating unit. Try this:
    use re 'eval'; $_ = q'abczdefzabcghijklzabczaerabrtyuabcethdauthabkudiabc'; if (/([a-z]{3}).+?\1/) { $unit = $1; m< ( (?{ push @found, pos() }) $unit ) \G .+? \1 >gx; print join ', ', @found, "\n"; }

    -Mark

Re: Finding a repeating substring with a regex
by davido (Archbishop) on Apr 15, 2006 at 08:45 UTC

    Here's a method that doesn't require that you first determine what is being repeated:

    use strict; use warnings; use diagnostics; our @found; # 0 1 2 3 4 5 # 012345678901234567890123456789012345678901234567890 my $string = q'abczdefzabcghijklzabczaerabrtyuabcethdauthabkudiabc'; # abc abc abc abc abc # 0 8 18 31 48 if( $string =~ m/ (.{3}) (?: .*? (\1) (?{ push @found, pos() - length($^N); }) )+ /x ) { print "$1: @found\n"; }

    Note the placement of the (?{...}) code after the \1 backreference condition has been met. This makes it so that the (?{...}) code is only executed if the backreference condition first passes as true. That way you don't see all the backtracked dead ends, only the branches that actually worked out to match. Because we place the 'push' after the backreference has matched, you have to subtract the length of the submatch from 'pos' to find the starting position.

    Also note that, in this case, .*? is preferable over .+?, because it is entirely possible that 'abcabc' would constitute a repeated substring, but such a case would be missed if we required there to exist a character between 'abc' and 'abc'.

    PS: For the life of me, I can never remember how to use \G without first looking it up yet again. Fortunately for me, my solution doesn't require it.


    Dave

Re: Finding a repeating substring with a regex
by japhy (Canon) on Apr 15, 2006 at 13:03 UTC
    Update: fixed @$^R to @{$^R}. You can use the return value of (?{ ... }) (which is $^R) which is subject to backtracking.
    $string =~ m{ (?{ [ pos ] }) # we start $^R as [ pos ] ([a-z]{3}) # find a string, store it in \1 (?: # now do this one or more times: .*? # match zero or more characters (non-greed +ily) (?{ [ @{$^R}, pos ] }) # append pos() to $^R's arrayref \1 # matching \1 at this location )+ (?{ @loc = @{$^R} }) # then save $^R }x

    Jeff japhy Pinyan, P.L., P.M., P.O.D, X.S.: Perl, regex, and perl hacker
    How can we ever be the sold short or the cheated, we who for every service have long ago been overpaid? ~~ Meister Eckhart
Re: Finding a repeating substring with a regex
by TedPride (Priest) on Apr 16, 2006 at 00:56 UTC
    The problem with doing this with a regex is you don't know whether or not there are multiple repeating substrings, and if there are, which has the most repeats. This is much better suited to a linear / hash approach, perhaps something like the following:
    use strict; use warnings; my (%count, $str, $size, $min); $str = 'abczdefzabcghijklzabczaerabrtyuabcethdauthabkudiabc'; $size = 3; # Size of substrings to look for $min = 2; # Minimum number of times substring appears $count{substr($str, $_, $size)}++ for 0..(length($str) - $size); for (sort {$count{$b} <=> $count{$a} || $a cmp $b} keys %count) { last if $count{$_} < $min; print "$_ :"; print ' '.(pos($str)-$size) while $str =~ /$_/g; print " ($count{$_} matches)\n"; }

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://543507]
Approved by Zaxo
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (12)
As of 2014-09-30 20:21 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (383 votes), past polls