Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Re: Analysing a (binary) string.

by hdb (Parson)
on Jun 28, 2013 at 07:11 UTC ( #1041161=note: print w/ replies, xml ) Need Help??


in reply to Analysing a (binary) string. (Solved)

Some comments:

  • Not starting with a complete substring is impossible. Your example 'bcd abcd abcd ab' really is 'bcda bcda bcda b'. So without loss of generality you can always assume that the string starts with the pattern.
  • Due to the occasional errors in the string you need to replace equality with some check tolerating a few differences, eg for some offset $i and a given $tolerance
    my $n = length $$strref; # shift/rotate string and compare to original my $diff = $$strref ^ substr( $$strref, -$i ).substr( $$strref, 0, -$i + ); # number of differing characters between shifted string and original my $ndiff = $diff =~ tr/\0//c; return 1 if $ndiff/$n < $tolerance
  • A brute force method (iterating over all offsets) has probably quadratic runtime as a function of string length, but might stop short most of the time if the repeating pattern is not too long compared to the overall string and the tolerance not set too small.
  • The skip ahead method from earlier discussions (Finding repeat sequences.) is not reliable due to the errors but tye has already proposed an alternative.


Comment on Re: Analysing a (binary) string.
Select or Download Code
Re^2: Analysing a (binary) string.
by BrowserUk (Pope) on Jun 28, 2013 at 10:50 UTC
    Not starting with a complete substring is impossible. Your example 'bcd abcd abcd ab' really is 'bcda bcda bcda b'. So without loss of generality you can always assume that the string starts with the pattern.

    That is a really astute observation, and one with consequences for my application. (Pretty obvious now you've pointed it out, but it wasn't so before. :)

    These big strings are really themselves substrings within even larger (effectively infinite) string of repeats, of which I am able to grab a snapshot. I am sampling a lump of that infinite string starting and stopping at some random position, hence the "bit at the beginning and bit at the end" description.

    The consequence of your observation is that while the repeat does have a definite start, I can never determines that from my snapshot. I can find the length and content of the repeat -- so long as I have sampled enough of th data -- but my version of it may be rotated from the real thing,

    I don't think that matters for my purpose, but it is good to know.

    The skip ahead method from earlier discussions (Finding repeat sequences.) is not reliable due to the errors but tye has already proposed an alternative.

    Yes, the possibility of errors is the reason for needing a new approach.

    And indeed, tye's notion has allowed me to both find the repeats in samples ranging from 11MB to 31MB very quickly; and discover that 3MB through 8MB is often not enough.

    This is the code I used based on his idea:

    #! perl -slw use strict; use Data::Dump qw[ pp ]; open I, '<:raw', $ARGV[0] or die $!; my $s = do{ local $/; <I> }; close I; $|++; print length $s; my @c; ++$c[ ord $1 ] while $s =~ m[(.)]g; pp \@c; scalar <STDIN>; for( my $i = $#c; $i; --$i ) { next unless $c[ $i ] > 2; my @p; $p[ @p ] = $-[0] while $s =~ m[${ \chr( $i )}]g; my @spacing = map{ $p[ $_ + 1 ] - $p[ $_ ] } 0 .. $#p-1; print ">>@spacing"; scalar <STDIN>; }

    Which produces (severely cut down for posting):

    5644800 [ undef, 1455300, 1455300, 1656200, 386120, 429240, 184240, 56840, 11760, 7840, undef, 1960, ] >>2134 3626 2134 3626 2134 3626 2134 3626 2134 3626 2134 3626 ... for +1960 values. Use of uninitialized value within @c in numeric gt (>) at C:\ >>685 644 813 638 813 644 685 838 685 644 813 638 813 644 685 ... for +7840 values >>618 26 717 739 8 739 717 26 618 775 2 775 618 26 717 739 8 739 717 2 +6 618 775 2 775 ... for 11760 values. Terminating on signal SIGINT(2)

    The obvious repetition in the first set of positional differences (2134 + 3626) sums to 5760.

    That allows me to see the repetition in the second set (685 + 644 + 813 + 638 + 813 + 644 + 685 + 838) = 5760;

    And in the third set (618 + 26 + 717 + 739 + 8 + 739 + 717 + 26 + 618 + 775 + 2 + 775) = 5760.

    And with 3 confirmations, I know the repetition size.

    Conversely, on samples that aren't big enough to capture the repetition, there are no correlations. Job done. Thank you tye.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      I'm glad it turned out to be so helpful. You're welcome, BrowserUk.

      - tye        

      Even though you declared this problem as solved, I have been thinking about the skip ahead technique. It could be used if

      1. you know a minimum length of the pattern, say at least 100,
      2. and you know that the number of errors is small compared to the length of the pattern, say at most 5.
      You could then try finding a repeat of the first 100 chars somewhere later in the string using some tolerant matching technique, say at position 5000, and then lengthen the pattern to the first 5000 characters. This will not be as efficient as the original skip ahead but might speed up things somewhat. Not sure how it compares to the frequency analysis.



        Not sure how it compares to the frequency analysis.

        In the generic, that might be worth pursuing though my (fairly extensive) experience of fuzzy matching techniques is that they are always slow.

        For the specific case of my data, the frequency analysis proved so simple and efficient that it wasn't even worth timing it. My first attempt at the code worked first time and found all the reps that were there to be found in just a few seconds. There was no reason (for me) to pursue this further.

        If my dataset had not proven to be so amenable to the frequency analysis method -- near perfect inverse log distribution -- I might still be looking for another method, but I have plenty of other nuts to crack with this particular dataset :)


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1041161]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others perusing the Monastery: (7)
As of 2014-07-12 13:16 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    When choosing user names for websites, I prefer to use:








    Results (240 votes), past polls