Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??
Not starting with a complete substring is impossible. Your example 'bcd abcd abcd ab' really is 'bcda bcda bcda b'. So without loss of generality you can always assume that the string starts with the pattern.

That is a really astute observation, and one with consequences for my application. (Pretty obvious now you've pointed it out, but it wasn't so before. :)

These big strings are really themselves substrings within even larger (effectively infinite) string of repeats, of which I am able to grab a snapshot. I am sampling a lump of that infinite string starting and stopping at some random position, hence the "bit at the beginning and bit at the end" description.

The consequence of your observation is that while the repeat does have a definite start, I can never determines that from my snapshot. I can find the length and content of the repeat -- so long as I have sampled enough of th data -- but my version of it may be rotated from the real thing,

I don't think that matters for my purpose, but it is good to know.

The skip ahead method from earlier discussions (Finding repeat sequences.) is not reliable due to the errors but tye has already proposed an alternative.

Yes, the possibility of errors is the reason for needing a new approach.

And indeed, tye's notion has allowed me to both find the repeats in samples ranging from 11MB to 31MB very quickly; and discover that 3MB through 8MB is often not enough.

This is the code I used based on his idea:

#! perl -slw use strict; use Data::Dump qw[ pp ]; open I, '<:raw', $ARGV[0] or die $!; my $s = do{ local $/; <I> }; close I; $|++; print length $s; my @c; ++$c[ ord $1 ] while $s =~ m[(.)]g; pp \@c; scalar <STDIN>; for( my $i = $#c; $i; --$i ) { next unless $c[ $i ] > 2; my @p; $p[ @p ] = $-[0] while $s =~ m[${ \chr( $i )}]g; my @spacing = map{ $p[ $_ + 1 ] - $p[ $_ ] } 0 .. $#p-1; print ">>@spacing"; scalar <STDIN>; }

Which produces (severely cut down for posting):

5644800 [ undef, 1455300, 1455300, 1656200, 386120, 429240, 184240, 56840, 11760, 7840, undef, 1960, ] >>2134 3626 2134 3626 2134 3626 2134 3626 2134 3626 2134 3626 ... for +1960 values. Use of uninitialized value within @c in numeric gt (>) at C:\ >>685 644 813 638 813 644 685 838 685 644 813 638 813 644 685 ... for +7840 values >>618 26 717 739 8 739 717 26 618 775 2 775 618 26 717 739 8 739 717 2 +6 618 775 2 775 ... for 11760 values. Terminating on signal SIGINT(2)

The obvious repetition in the first set of positional differences (2134 + 3626) sums to 5760.

That allows me to see the repetition in the second set (685 + 644 + 813 + 638 + 813 + 644 + 685 + 838) = 5760;

And in the third set (618 + 26 + 717 + 739 + 8 + 739 + 717 + 26 + 618 + 775 + 2 + 775) = 5760.

And with 3 confirmations, I know the repetition size.

Conversely, on samples that aren't big enough to capture the repetition, there are no correlations. Job done. Thank you tye.


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

In reply to Re^2: Analysing a (binary) string. by BrowserUk
in thread Analysing a (binary) string. (Solved) by BrowserUk

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others about the Monastery: (6)
As of 2024-04-24 10:07 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found