Welcome to the Monastery PerlMonks

### Analysing a (binary) string. (Solved)

by BrowserUk (Pope)
 on Jun 28, 2013 at 03:34 UTC Need Help??
BrowserUk has asked for the wisdom of the Perl Monks concerning the following question:

(Pre-note: This is neither DNA nor a compression scheme.)

I have a bunch of large (up to 20MB) binary strings (bytes), and I'm looking for repeating patterns within those strings.

I believe the strings consist of many contiguous repetitions of a substring of unknown length and I wish to find that substring.

Wrinkles:

• The string probably doesn't start or end with a complete substring.

(eg. it might be 'bcd abcd abcd ab' (without the spaces)).

• Some or all of the substring repeats may contain occasional errors.

Eg. (again spaces for clarification only) 'cdef abcdef abbdef aacdef abcdcf abcdef abd'

• To emphasis: I don't know before hand how long the substrings might be.

Could be 10s 100s, 1000s, 10000s etc.

Any thoughts on a way to tackle this?

Update: Now solved thanks to tye See Re^2: Analysing a (binary) string. for details.

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

Replies are listed 'Best First'.
Re: Analysing a (binary) string. (rare, distances)
by tye (Sage) on Jun 28, 2013 at 04:46 UTC

Find the least common byte value(s) and look at how far apart they are. Look for repeating patterns in those differences, thus reducing the size of the problem probably by orders of magnitude. Verify which of these repeating patterns of differences actually represent repeating patterns of bytes.

- tye

Find the least common byte value(s)

That is an utterly excellent notion. Thank you.

Doing frequency analysis on several of my strings produces stats like this:

```[
undef,
614567,
614576,
785984,
225487,
262875,
174368,
67367,
28011,
30654,
4738,
6848,
3784,
592,
435,
175,
8,
14,
2,
]

Of course, some of those low numbers could be due to the errors, but looking for common differences in the positions of the characters in the reverse of that ordering should confirm or deny my hypothesis very quickly.

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Analysing a (binary) string.
by hdb (Monsignor) on Jun 28, 2013 at 07:11 UTC

• Not starting with a complete substring is impossible. Your example 'bcd abcd abcd ab' really is 'bcda bcda bcda b'. So without loss of generality you can always assume that the string starts with the pattern.
• Due to the occasional errors in the string you need to replace equality with some check tolerating a few differences, eg for some offset \$i and a given \$tolerance
```my \$n = length \$\$strref;
# shift/rotate string and compare to original
my \$diff = \$\$strref ^ substr( \$\$strref, -\$i ).substr( \$\$strref, 0, -\$i
+ );
# number of differing characters between shifted string and original
my \$ndiff = \$diff =~ tr/\0//c;
return 1 if \$ndiff/\$n < \$tolerance
• A brute force method (iterating over all offsets) has probably quadratic runtime as a function of string length, but might stop short most of the time if the repeating pattern is not too long compared to the overall string and the tolerance not set too small.
• The skip ahead method from earlier discussions (Finding repeat sequences.) is not reliable due to the errors but tye has already proposed an alternative.

Not starting with a complete substring is impossible. Your example 'bcd abcd abcd ab' really is 'bcda bcda bcda b'. So without loss of generality you can always assume that the string starts with the pattern.

That is a really astute observation, and one with consequences for my application. (Pretty obvious now you've pointed it out, but it wasn't so before. :)

These big strings are really themselves substrings within even larger (effectively infinite) string of repeats, of which I am able to grab a snapshot. I am sampling a lump of that infinite string starting and stopping at some random position, hence the "bit at the beginning and bit at the end" description.

The consequence of your observation is that while the repeat does have a definite start, I can never determines that from my snapshot. I can find the length and content of the repeat -- so long as I have sampled enough of th data -- but my version of it may be rotated from the real thing,

I don't think that matters for my purpose, but it is good to know.

The skip ahead method from earlier discussions (Finding repeat sequences.) is not reliable due to the errors but tye has already proposed an alternative.

Yes, the possibility of errors is the reason for needing a new approach.

And indeed, tye's notion has allowed me to both find the repeats in samples ranging from 11MB to 31MB very quickly; and discover that 3MB through 8MB is often not enough.

This is the code I used based on his idea:

```#! perl -slw
use strict;
use Data::Dump qw[ pp ];

open I, '<:raw', \$ARGV[0] or die \$!;
my \$s = do{ local \$/; <I> };
close I;
\$|++; print length \$s;

my @c; ++\$c[ ord \$1 ] while \$s =~ m[(.)]g;
pp \@c; scalar <STDIN>;

for( my \$i = \$#c; \$i; --\$i ) {
next unless \$c[ \$i ] > 2;
my @p; \$p[ @p ] = \$-[0] while \$s =~ m[\${ \chr( \$i )}]g;
my @spacing = map{ \$p[ \$_ + 1 ] - \$p[ \$_ ]   } 0 .. \$#p-1;
print ">>@spacing"; scalar <STDIN>;
}

Which produces (severely cut down for posting):

```5644800
[
undef,
1455300,
1455300,
1656200,
386120,
429240,
184240,
56840,
11760,
7840,
undef,
1960,
]

>>2134 3626 2134 3626 2134 3626 2134 3626 2134 3626 2134 3626 ... for
+1960 values.

Use of uninitialized value within @c in numeric gt (>) at C:\
>>685 644 813 638 813 644 685 838 685 644 813 638 813 644 685 ... for
+7840 values

>>618 26 717 739 8 739 717 26 618 775 2 775 618 26 717 739 8 739 717 2
+6 618 775 2 775 ... for 11760 values.
Terminating on signal SIGINT(2)

The obvious repetition in the first set of positional differences (2134 + 3626) sums to 5760.

That allows me to see the repetition in the second set (685 + 644 + 813 + 638 + 813 + 644 + 685 + 838) = 5760;

And in the third set (618 + 26 + 717 + 739 + 8 + 739 + 717 + 26 + 618 + 775 + 2 + 775) = 5760.

And with 3 confirmations, I know the repetition size.

Conversely, on samples that aren't big enough to capture the repetition, there are no correlations. Job done. Thank you tye.

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

I'm glad it turned out to be so helpful. You're welcome, BrowserUk.

- tye

Even though you declared this problem as solved, I have been thinking about the skip ahead technique. It could be used if

1. you know a minimum length of the pattern, say at least 100,
2. and you know that the number of errors is small compared to the length of the pattern, say at most 5.
You could then try finding a repeat of the first 100 chars somewhere later in the string using some tolerant matching technique, say at position 5000, and then lengthen the pattern to the first 5000 characters. This will not be as efficient as the original skip ahead but might speed up things somewhat. Not sure how it compares to the frequency analysis.

Re: Analysing a (binary) string.
by GrandFather (Sage) on Jun 28, 2013 at 06:06 UTC

Is there a constraint on the minimum interesting substring length?

Can you treat a "short" part of the large string as representative for purposes of discovering the repeating substring?

Is there a minimum or maximum amount of "padding" between repeats?

May there be non-repeating header or tail sections in the large string?

Maybe a little more of the bigger picture may help understand the constraints on the problem?

True laziness is hard work
1. Is there a constraint on the minimum interesting substring length?

If the substring was less than a few tens, any repeating pattern would be obvious visually -- and it isn't -- so say 50 or 60 as a minimum.

2. Can you treat a "short" part of the large string as representative for purposes of discovering the repeating substring?

Certainly as a first pass. Once boundaries are established it is easy to go back and check the full length.

3. Is there a minimum or maximum amount of "padding" between repeats?

No padding between repeats. (Hence "contiguous" in the OP.)

4. May there be non-repeating header or tail sections in the large string?

In the form of a partial (end) repeat at the beginning and a partial (start) repeat at the end. With both obviously shorter than the length of the repeat.

5. Maybe a little more of the bigger picture may help understand the constraints on the problem?

Not really. They really are just strings of small numbers that if there are repeats it allows me to do one thing; if not then I need to do something else.

I know there are repeats in the bigger dataset that these strings are a small subset -- I've already found some of them -- but it may be that the subsets I currently have are too short to contain the 2 or more whole repeats that is required for me to recognise them.

The really big picture would take a great deal of explaining and only lead to rambling side discussions that wouldn't help with this particular sub-problem.

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Analysing a (binary) string.
by ambidexter (Initiate) on Jun 28, 2013 at 05:40 UTC
Hi, BrowserUk

If i understand it right, ! Since we dont know the length of the sub string that might repeat. It can be of length from 1-N/2. So if we start from 1 we need to parse the whole array N and for length 2 - (N-1) for length 3 - (N-2). So the time complexity of any algorithm for this problem would be "~N factorial". So if the string length is > 100 then it might take hours to get the complete solution !!

I cannot see where you get N factorial from. To me it seems what you describe is N^2. This is also what I get from the brute force method below:

```use strict;
use warnings;
use Time::HiRes qw/time/;

sub find_best_offset {
my \$strref = shift;
my \$n = length \$\$strref;
my \$mindiff = \$n;
my \$best_offset = -1;
for( my \$i=1; \$i<=\$n/2; \$i++ ) {
# shift/rotate string and compare to original
my \$diff = \$\$strref ^ substr( \$\$strref, -\$i ).substr( \$\$strref
+, 0, -\$i );
# number of differing characters between shifted string and or
+iginal
my \$ndiff = \$diff =~ tr/\0//c;
# test for and keep optimal solution
if( \$ndiff < \$mindiff ) {
return \$i unless \$ndiff; # best case!
\$mindiff = \$ndiff;
\$best_offset = \$i;
}
}
return \$best_offset;
}

for ( 9, 99, 999, 9999, 99999 ) {
my \$t = time;
my \$str1 = "01234567890123456789".("01234567890123456x89" x \$_);
print find_best_offset( \\$str1 );
print "\t\$_\t",time-\$t,"\n";
}
Re: Analysing a (binary) string.
by BerntB (Deacon) on Jun 28, 2013 at 10:01 UTC

This wasn't a bio problem, but... could someone that knows bioperl/DNA sequencing comment on the possibility of this methodology?

1. Chop up the long binary string into substrings
2. Encode the bytes as DNA base pairs
3. Use existing alignment implementations for DNA sequences

You'd probably want to do that with different size substrings, of course. (And yeah, this would be using a screwdriver as a saw, or something.)

Re: Analysing a (binary) string. (Solved)
by hexcoder (Chaplain) on Jun 28, 2013 at 13:36 UTC
Hi,

just for completeness (since its solved). A suitable data structure could be that of a suffix tree or a suffix array. It allows finding the longest repeating substring(s) in linear time.

I am using it in the detection of cut-and-paste code.

Two questions:

1. Have you ever built a suffix (or any kind of) tree in Perl with 30 million nodes?
2. What effect do you think the presence of errors in the repeats would have upon the suffix trees ability to detect them?

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
1. Not yet, but a suffix array has lower space requirements than a suffix tree.

2. In Dan Gusfield's book 'Algorithms on Strings, Trees, and Sequences' several inexact matching solutions based on suffix trees are discussed (for example in chapter 12.2). If there are up to k errors without insertions and deletions, it is the k-mismatch problem (see chap. 9.4) with a solution running in O(km), where k is the number of errors and m is the length of the string. Otherwise if the errors shall include insertions/deletions it is called the k-difference problem.

Hope that helps.

Create A New User
Node Status?
node history
Node Type: perlquestion [id://1041134]
Approved by Old_Gray_Bear
Front-paged by Old_Gray_Bear
help
Chatterbox?
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others taking refuge in the Monastery: (5)
As of 2017-08-22 22:40 GMT
Sections?
Information?
Find Nodes?
Leftovers?
Voting Booth?
Who is your favorite scientist and why?

Results (344 votes). Check out past polls.

Notices?