Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

Re: Finding repeat sequences. (only mostly regex)

by tye (Sage)
on Jun 18, 2013 at 19:40 UTC ( [id://1039635]=note: print w/replies, xml ) Need Help??


in reply to Finding repeat sequences.

I assume that the pattern must repeat at least twice, otherwise, the full string is always the longest answer.

A simple regex can get a good guess and tell you when that guess has failed in such a way that each subsequent guess will be more than twice as long as the previous guess so the regex doesn't have to be run very many times:

sub repeating { my( $string ) = @_; my( $pattern, $repeat, $end ) = $string =~ /^(.+?)(\1+)(.*)$/; while( defined $pattern ) { return $pattern if length($end) <= length($pattern) && $end eq substr($pattern,0,length($end)); print "($pattern) wasn't long enough.\n"; ( $pattern, $repeat, $end ) = $string =~ /^(\Q$pattern$repeat\E.+?)(\1+)(.*)$/ } return undef; } my $pattern = repeating( "aabaabaabcaabaabaabca" ); printf "(%s) wins\n", $pattern if $pattern; __END__ (a) wasn't long enough. (aab) wasn't long enough. (aabaabaabc) wins

You likely can optimize this by copying less stuff, of course.

(Update: Well, I didn't get very rigorous in proving to myself that $pattern.$repeat is always too short. But I believe that to be the case. One should validate or refute that assumption before deciding whether to use this.)

- tye        

Replies are listed 'Best First'.
Re^2: Finding repeat sequences. (only mostly regex)
by BrowserUk (Patriarch) on Jun 18, 2013 at 20:04 UTC
    I assume that the pattern must repeat at least twice, otherwise, the full string is always the longest answer.

    I wish that were the case. It mostly will be, but sometimes the string will consist of 1 complete and 1 partial rep.

    But the partial rep at the end *will* exactly match the same number of characters at the beginning of the string, so it will always be possible to determine the rep.

    But how to encode that in a regex or at least avoid a brute force chop and compare?


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      Note that, based on that definition, if the first and last characters are the same, then the answer is "the string minus the last character". Which leads to:

      /^(.+?).*\1$/

      Which leads to a full solution of:

      /^((.*?).*?)\2*\1$/

      which might be horribly inefficient (at least for some cases) or might not; I haven't considered it.

      - tye        

        Nice reversal of the logic and closer:

        $s = 'aaaabaaaabaaaaabaaaab';; $s =~ /^((.*?).*?)\2*\1$/ and print "$2/$1";; a/aaaabaaaab

        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
      Cannot you find the incomplete repetition with
      /^(.*).*\1$/
      ?

      If it is complete, you get the whole one.

      لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ

        There are no gaps between the repeats, so the uncaptured .* is not required (actually mustn't be there).

        And if the second rep is incomplete \1 will never match before $.

        I've been trying variations on

        $s = 'aaaabaaaabaaaaabaaaab';; $s =~ m[^(.+)\1*(.*?$)] and $1 =~ $2 and print "$1/$2";; aaaabaaaabaaaaabaaaab/

        With the idea that any partial rep at the end can be verified again the beginning of the full rep, but it needs to happen inside the regex and cause backtracking.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1039635]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others studying the Monastery: (3)
As of 2024-04-19 05:17 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found