Clear questions and runnable code get the best and fastest answer |
|
PerlMonks |
comment on |
( [id://3333]=superdoc: print w/replies, xml ) | Need Help?? |
Not starting with a complete substring is impossible. Your example 'bcd abcd abcd ab' really is 'bcda bcda bcda b'. So without loss of generality you can always assume that the string starts with the pattern. That is a really astute observation, and one with consequences for my application. (Pretty obvious now you've pointed it out, but it wasn't so before. :) These big strings are really themselves substrings within even larger (effectively infinite) string of repeats, of which I am able to grab a snapshot. I am sampling a lump of that infinite string starting and stopping at some random position, hence the "bit at the beginning and bit at the end" description. The consequence of your observation is that while the repeat does have a definite start, I can never determines that from my snapshot. I can find the length and content of the repeat -- so long as I have sampled enough of th data -- but my version of it may be rotated from the real thing, I don't think that matters for my purpose, but it is good to know. The skip ahead method from earlier discussions (Finding repeat sequences.) is not reliable due to the errors but tye has already proposed an alternative. Yes, the possibility of errors is the reason for needing a new approach. And indeed, tye's notion has allowed me to both find the repeats in samples ranging from 11MB to 31MB very quickly; and discover that 3MB through 8MB is often not enough. This is the code I used based on his idea:
Which produces (severely cut down for posting):
The obvious repetition in the first set of positional differences (2134 + 3626) sums to 5760. That allows me to see the repetition in the second set (685 + 644 + 813 + 638 + 813 + 644 + 685 + 838) = 5760; And in the third set (618 + 26 + 717 + 739 + 8 + 739 + 717 + 26 + 618 + 775 + 2 + 775) = 5760. And with 3 confirmations, I know the repetition size. Conversely, on samples that aren't big enough to capture the repetition, there are no correlations. Job done. Thank you tye. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
In reply to Re^2: Analysing a (binary) string.
by BrowserUk
|
|