Longest repeated string...

Yzzyx has asked for the wisdom of the Perl Monks concerning the following question:

I have a program that I wrote that searches through a string of digits and prints out all the matches of substrings that are repeated.

String:

2305843009213693951

Results:

length: 1 digits: 0 quantity: 3
length: 1 digits: 1 quantity: 2
length: 1 digits: 2 quantity: 2
length: 1 digits: 3 quantity: 4
length: 1 digits: 5 quantity: 2
length: 1 digits: 9 quantity: 3
length: 2 digits: 30 quantity: 2
[download]

The program:

#!/usr/bin/perl -w
use strict;
my $line;
while ( defined ( $line = <> ) ) {
        chomp $line;
        for my $a ( 1 .. length ( $line ) ) {
                for my $b ( 1 .. ( length ( $line ) + 1 - $a ) ) {
                        my $out = substr ( $line, $a - 1, $b );
                        my $count = () = $line =~ /$out/g;
                        print "length: $b digits: $out quantity: $coun
+t\n" if $count > 1;
                }
        }
}
[download]

How I run it:

$ ./test < textfile | sort | uniq

The input string is formatted to be only the digits. I stripped all other characters except the trailing newline beforehand. I don't have any error checking for the input in place yet.

My thought process here is to generate every possible substring and then check each of these against the string and count the results. If the result is greater than 1 then I print it out. Since any result greater than 1 will have multiple answers, I sort the results and then run uniq on them. This also sorts the results so the largest ones are listed last.

Now, the program works, but as the string gets longer it gets much slower. I have gotten up to a 6,500 digit string (M25) which took about 30 minutes on a 2.8GHz P4.

My goal is to test a 9 million digit string (M43). I assume I need a more intelligent way to code this. The brute force approach isn't going to work.

Here is a web page of the strings I am working with:

http://www.mersennewiki.org/index.php/Mersenne_primes

The files I am working with are the ones in the third column.

Right now those files are split across multiple lines. I manually fix them (cat file | tr -d '\n' | sed 's/$/\n/' > a.a ; mv a.a file) but it would be nice to have this done by the program.

Warning: I know my Perl skills are weak, so please take it easy on me with regard to my coding style. I am interested in learning but I need it in baby steps. :)

Finally, the reason I started this programming exercise:

http://www.mersenneforum.org/showthread.php?t=5414

(I am "Xyzzy" on that forum.)

Again, I am very new to programming so I usually have do "reverse engineer" code snippets to a flow chart before I can really understand them.

I assume my problem here (Other than the ugly code) is the fact that the number of possible substrings is growing exponentially.

Thanks for any help!

PS - I also don't understand the () in: my $count = () = $line =~ /$out/g;

Comment on Longest repeated string... Select or Download Code

Replies are listed 'Best First'.
Re: Longest repeated string... by japhy (Canon) on Feb 03, 2006 at 16:26 UTC
In response to your postscript, a regex with the `/g` modifier in list context returns a list of the matches (or a list of the captures for each match). The `()` in `() = $line =~ /.../g` enforces list context, even though you're assigning to an empty list. The other half of the trick is that list assignment in scalar context returns the number of elements being assigned (not being assigned TO), so `my $count = () = $line =~ /.../g` stores the number of elements returned by `$line =~ /.../g`. Here's my "regex for the sake of regex" solution: `m{ (?{ [ {}, 0 ] }) ^ \d? (\d+) (?(?{ length($1) > (length($_) - $-[1])/2 or $^R->[0]{$1}++ })(?!)) (?{ [ $^R->[0], 1 ] }) (?> (?: \d? \1 (?{ [ $^R->[0], $^R->[1] + 1 ] }) )+ (?{ print report($1, $^R->[1]) }) ) (?!) }x; sub report { my ($str, $count) = @_; sprintf "length: %d digits: %s quantity: %d\n", length($str), $str, $count; }` [download] Dissection will come later. For now, just breathe it in. Read more... (3 kB) Jeff `japhy` Pinyan, P.L., P.M., P.O.D, X.S.: Perl, regex, and `perl` hacker How can we ever be the sold short or the cheated, we who for every service have long ago been overpaid? ~~ Meister Eckhart	[reply] [d/l] [select]
Re: Longest repeated string... by Roy Johnson (Monsignor) on Feb 03, 2006 at 19:17 UTC
If I understand correctly, you're interested in long substrings repeated numerous times. Here's a roughed-out approach based roughly on my Re^3: Fast common substring matching, which is, in turn, based roughly on the bzip algorithm. For purposes of sanity, take a stab at some length of substring that you simply don't expect to be repeated. Maybe somewhere in the 100 to 1000 range. Make a file of all substrings of that length. (Unix) `sort` the lines of the file. Now you just need to look at each line's neighbors to see how long their common prefix is. The longest repeated substring is guaranteed to be a common prefix of two neighboring lines. To find how many times a substring is repeated, you'll have to keep checking subsequent lines until they don't have enough prefix in common to be interesting. I haven't written any code for this, and probably won't. But I hope the description is helpful. Update: I wrote some example code for finding the longest repeated substring. More: I enhanced it to generate a long string (60000 digits) and to limit the substring length, and to time itself. It runs in under 2 seconds on my PC. Read more... (899 Bytes) Caution: Contents may have been coded under pressure.	[reply] [d/l] [select]
Re: Longest repeated string... by Limbic~Region (Chancellor) on Feb 03, 2006 at 17:21 UTC
Yzzyx, First, you are going to want to google for 'longest common substring' as your brute force approach is definately not the right way to go. Second, the idiom you don't understand deserves to be in the "Perl Idioms Explained" category. In a nutshell, Perl will do what you mean (DWYM) when you give it proper context. A list in scalar context returns the number of items in the list. Third - good luck. No matter what algorithm you use (LCS) and what language you use (C/Assembler), this is not going to be a fast answer. Cheers - L~R	[reply]
Re^2: Longest repeated string... by japhy (Canon) on Feb 03, 2006 at 17:37 UTC
A list in scalar context returns the number of items in the list. Not quite. A list assignment in scalar context returns the number of items in the right-hand list. Jeff `japhy` Pinyan, P.L., P.M., P.O.D, X.S.: Perl, regex, and `perl` hacker How can we ever be the sold short or the cheated, we who for every service have long ago been overpaid? ~~ Meister Eckhart	[reply]
Re^3: Longest repeated string... by Limbic~Region (Chancellor) on Feb 03, 2006 at 17:42 UTC
japhy, Thanks - which is why I said this deserved an entry in "Perl Idioms Explained". I could argue that there is no assignment going on in: `while ( @array ) { ... } # or print "Array is empty\n" if ! @array;` [download] but then we could discuss the difference between lists and arrays and it wouldn't be worth it. I think I was able to get the OP to understand even if I wasn't completely accurate. I do think it would make for a good entry in "Perl Idioms Explained". Cheers - L~R	[reply] [d/l]
Re^4: Longest repeated string... by japhy (Canon) on Feb 03, 2006 at 17:45 UTC
Re^5: Longest repeated string... by Limbic~Region (Chancellor) on Feb 03, 2006 at 17:48 UTC
Re: Longest repeated string... by glasswalk3r (Friar) on Feb 03, 2006 at 16:25 UTC
I had taken a chance of reading the weblink that you wrote but after 30 seconds I gave up... so I will try to give only general advices here. First of all, don't do this: `$ ./test < textfile \| sort \| uniq` Perl has more then enough tools to do the job that is done with sort and uniq programs. System calls are expensive and sometimes the speed of those programs doesn't pay for the cost of invoking them. To read the file, use `open` function. If the file is not that big, you can read it entirely and put into memory like this: `open(IN,"<$file") or die "Cannot read $file: $!\n"; my @content = <IN>; close(IN);` [download] This will speed up things than using `while` block. Use an array to keep the results from the string: `$results[0]++ if ( $digit == 0 );` [download] You can even avoid using `if` statement to do that. For the next string to process, do a `@results = ()` to start again with zeros. Try as much as you can to avoid using next loops with `for`. Look for the Schwartzian Transform to see how to improve your code. Try using `@sequence = split( //, $sequence )` instead of a other loop. And last but not least, check the `Benchmark` module to test all those things. Alceu Rodrigues de Freitas Junior --------------------------------- "You have enemies? Good. That means you've stood up for something, sometime in your life." - Sir Winston Churchill	[reply] [d/l] [select]
Re^2: Longest repeated string... by Limbic~Region (Chancellor) on Feb 03, 2006 at 17:33 UTC
glasswalk3r, First of all, don't do this: `$ ./test < textfile \| sort \| uniq` [download] Perl has more then enough tools to do the job that is done with sort and uniq programs. System calls are expensive and sometimes the speed of those programs doesn't pay for the cost of invoking them. I think you are making the mistake of repeating what you have heard others say without really understanding it yourself. In this particular case, sort and uniq are likely compiled C programs optimized for a single task and are far superior to Perl. While system calls can be expensive - it is just not the case here. `open(IN,"<$file") or die "Cannot read $file: $!\n"; my @content = <IN> +; close(IN); close(IN);` [download] This will speed up things than using while block. Well it may speed things up at the expense of memory. I do not know how many lines are in the file but if individual strings are 9 million characters this may definately be the wrong way to go. You still need to loop through the array so it is not going to avoid the need to loop. The speed savings come in from disk I/O. Try as much as you can to avoid using next loops with for. Look for the Schwartzian Transform to see how to improve your code. Try using @sequence = split( //, $sequence ) instead of a other loop. I am not exactly sure why you think using next inside a for loop is a bad thing. If it is possible to eliminate those loops prior to entering the loop then it is advantageous because you don't have a conditional every loop. That is seldom the case. The ST is used to speed up sorting routines when the comparison of 2 elements is expensive. This looks out of place in the context of the rest of what you said so you should probably be sure to explain why what you are saying has relavence. Finally, the real problem here is the numbers involved. Using a brute force algorithm, no matter how well it is tuned, to find the longest common substring of a 9 million digit number is going to be extremely slow. If you are interested in the math I will be happy to provide it. Cheers - L~R	[reply] [d/l] [select]
Re: Longest repeated string... by hv (Prior) on Feb 03, 2006 at 19:31 UTC
Here is a simple basic test for "is there an $n-digit repeated substring in `$string`?": `$string =~ /(.{$n}).\1/;` [download] That is: find an $n-digit substring, capture it as `$1`, then look forward in the string for another copy of `$1`. This is a reasonably efficient way to search for a repeated substring of a given length. However on a 9.2MB string that is still going to take a lot of work. Note that this won't find an overlapping repeat, such as the repeated "121" in the string "3121213"; a more complex regexp could find such repeats, but would be much slower working on large strings. One approach to finding the longest repeat would be to iterate `$n` upwards from 1: `sub longestrepeat { my($string) = @_; for (my $n = 1; 1; ++$n) { return $n - 1 unless $string =~ /(.{$n}).\1/; } }` [download] Other approaches to consider if that isn't fast enough are a) to do a binary chop (probably not useful in this case, since I think `$n` is likely to be smaller than log_2(digits)), or b) to start each subsequent search at the point the previous one succeeded (not too hard to code, but not likely to gain much). You can extend the regexp to search for 3 copies of the same substring: `$string =~ /(.{$n}).\1.\1/;` [download] .. or more, by including more copies of the `.*\1` construct. In general however I consider it extremely unlikely that you'll discover any patterns special to the digits of Mersenne primes expressed in decimal, beyond the features that any Mersenne number would have (ie the terminating digit patterns implicit in a number of the form 2ⁿ - 1). And of course doing the same thing in binary would be very boring. :) Hugo	[reply] [d/l] [select]
Re: Longest repeated string... by redEvo (Novice) on Jun 13, 2011 at 17:09 UTC
I know this thread is pretty old, but I just read it and saw a way to improve the code. One thing you could do that should speed things up is to store the results in a hash. You can then avoid using `uniq` and you will only be doing the matching operation on pieces you haven't yet seen. `#!/usr/bin/perl -w use strict; my $line; my %count; while ( defined ( $line = <> ) ) { chomp $line; for my $a ( 1 .. length ( $line ) ) { for my $b ( 1 .. ( length ( $line ) + 1 - $a ) ) { my $out = substr ( $line, $a - 1, $b ); next unless $count{ $out }; # so you don't have to use uniq next if ( $count{ $out } = () = $line =~ /$out/g ) > 1; print "length: $b digits: $out quantity: $count{ $out }"; } } }` [download] I just glanced through the other comments and saw some good points in there too, so if you're using this code make sure to take those into account as well. I didn't want to rewrite the whole thing, but just wanted to point out how the hash could be useful.	[reply] [d/l] [select]


go ahead... be a heretic
	PerlMonks