Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Generic De Bruijn Sequence

by QM (Parson)
on Apr 19, 2017 at 16:38 UTC ( [id://1188292]=CUFP: print w/replies, xml ) Need Help??

I needed to write a sequence for a test, where there are N elements, and all T-tuples of elements are used, in the shortest sequence. A search turned up De Bruijn sequences.

The easy algorithm takes N elements and produces every N-tuple permutation. With a little tinkering, I have the B(N,T) version. Not fast, can't handle large sequences, and runs out of space quickly. But for small T, does what I need.

It quickly became apparent that large T values would not be useful in my test, though N could be about 50.

Here are some results:

> debruijn.pl 3 3 AAACCCBCCACBBCBACABCAABBBABAA > debruijn.pl 4 4 AAAADDDDCDDDBDDDADDCCDDCBDDCADDBCDDBBDDBADDACDDABDDAADCDCDBDCDADCCCDCC +BDCCADCBCDCBBDCBADCACDCABDCAADBDBDADBCCDBCBDBCADBBCDBBBDBBADBACDBABDB +AADADACCDACBDACADABCDABBDABADAACDAABDAAACCCCBCCCACCBBCCBACCABCCAACBCB +CACBBBCBBACBABCBAACACABBCABACAABCAAABBBBABBAABABAAA > debruijn.pl 4 2 AADDCDBDACCBCABBA > debruijn.pl 5 2 AAEEDECEBEADDCDBDACCBCABBA

Here's the code.

#!/usr/bin/env perl # # Generate a string with the longest non-repeating subsequences possib +le. # Include overlaps. # # Input N, size of the alphabet. # Input T, tuples (pairs, triples, quadruples, etc.) # # Start with all N-length permutations. # Create a graph of all pairs A(B...Y)Z, # such that every pair whose left member ends with B...Y, # the right member starts with B...Y (for some length n-1) # Find an Eulerian path through the permutations (visit every node onl +y once) # The sequence of starting node, plus each additional ending letter, # is the De Bruijn sequence for this alphabet. use strict; use warnings; my $n = (shift or 4); # N for alphabet size my $t = (shift or $n); # T for Tuples (pairs, triples, quadruples) my $n1 = $n - 1; my $t1 = $t - 1; my @alphabet = ('A'..'Z','0'..'9','a'..'z'); if (@alphabet < $n) { die "Alphabet is smaller than $n\n"; } # glob character of length 1 my $alphabet = '{' . join(',', @alphabet[0..$n1]) . '}'; # Generate all strings of length $t in the given alphabet my $glob_string = $alphabet x $t; my @nodes = glob("$glob_string"); # Generate the graph of all strings that overlap in t-1 characters. my %graph; for my $node1 (@nodes) { for my $node2 (@nodes) { next if $node1 eq $node2; # If they overlap, add node2 to the array for node1 if (substr($node1,1,$t1) eq substr($node2,0,$t1)) { push @{$graph{$node1}}, $node2; } } } # String starts with first node's full string. # Walk through the graph: # Delete the node behind # Add the last char of next node to string # Print result my $node1 = $nodes[0]; my $q = 0; # print "$q : $node1\n"; # debug my $string = $node1; while (scalar keys %graph > 1) { my $moved = 0; # "reverse" here somehow "does the right thing", and enables an # Eulerian circuit with no added logic. for my $node2 (reverse @{$graph{$node1}}) { if (exists($graph{$node2})) { $string .= substr($node2,$t1,1); # Add last char to string delete($graph{$node1}); $node1 = $node2; $moved = 1; # print ++$q, " : $node2\n"; # debug last; } } # Avoid endless loops on pathological cases unless ($moved) { warn "Didn't find next node ($node1)\n"; last; } } print "$string\n"; exit;

I'm interested in tweaks to make it faster, smaller, better, etc. Or pointers to other solutions.

-QM
--
Quantum Mechanics: The dreams stuff is made of

Replies are listed 'Best First'.
Re: Generic De Bruijn Sequence
by BrowserUk (Patriarch) on Apr 19, 2017 at 17:59 UTC

    Here's a version I wrote for (OT) A different kind of 'combinatorics'. It'll theoretically handle up to a 63 bit DeBruijn sequence, but that would require 8 Etabytes of ram, so is untested :) It has been tested on 31-bit sequence, which require 2GB of ram and completes in under 30 minutes:

    #! perl -slw use strict; use bytes; use Data::Dump qw[ pp ]; $Data::Dump::WIDTH = 1000; ### Prefer Ones: ### Write n zeros. ### Then, always write a one unless it would cause the repetition of a +n n-length string; ### In which case, write a zero. our $N //= 4; my $t1 = "b${ \(2**$N+$N-1) }"; my $seen = ''; my $mask1 = ( 1<<$N )-1; my $mask2 = ( 1<<( 2**$N+$N-1 ) )-1; my $seq = pack 'Q*', (0) x 100; my $val = 0; for( $N .. 2**$N+$N-1 ) { ## ## if N=5; 5 .. 36; if N=6, 6 .. 64+6- +1 = 69; $val = ( $val << 1 ) & $mask1; vec( $seen, $val | 1, 1 ) or do{ $val |= 1; vec( $seq, $_, 1 ) = 1 +; }; vec( $seen, $val , 1 ) = 1; } print unpack $t1, $seq; __END__

    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". The enemy of (IT) success is complexity.
    In the absence of evidence, opinion is indistinguishable from prejudice. Suck that fhit
      I found a similar post in my search here earlier, but not that one. Probably some issue with whitespace and searching.

      -QM
      --
      Quantum Mechanics: The dreams stuff is made of

        I'd forgotten I'd posted that CufP. (Are you interested in the C version if I can find it?)

Re: Generic De Bruijn Sequence
by tybalt89 (Monsignor) on Apr 19, 2017 at 21:43 UTC

    Here's a solution.

    I'll leave it to you to determine if it's faster, smaller, better, etc.

    #!/usr/bin/perl -l # http://perlmonks.org/?node_id=1188292 use strict; use warnings; my $n = shift // 3; # alphabet size my $t = shift // $n; # size of tuples my @alphabet = ('A'..'Z', '0'..'9', 'a'..'z'); my %next; @next{'', @alphabet} = @alphabet; $_ = $alphabet[0] x $t; # start of string my $over = $alphabet[$n]; # outside of alphabet my $wantedlength = $n ** $t + $t - 1; while( length $_ >= $t ) { if( /^(?=(.{$t})).+\1/ ) { s/^./$next{$&}/; } elsif( s/$over(.)/$next{$1}/ ) { } elsif( $wantedlength == length $_ ) { print; # prove it is correct my %all; $all{$1}++ while /(?=(.{$t}))/g; my $count = keys %all; print "want @{[ $n ** $t ]} unique strings, have $count"; die "duplicate string" if grep $_ > 1, values %all; exit; } else { $_ = $alphabet[0] . $_; } }

    The loop seems awkward, there ought to a cleaner way...

Re: Generic De Bruijn Sequence
by LanX (Saint) on Apr 19, 2017 at 22:49 UTC
    Did you consult wikipedia or search for other algorithms?

    Many sources recommend an algorithm from Frank Ruskey , easily found implemented in various languages, here one in Python De_Bruijn_sequence#Algorithm which is not too hard to be ported.

    Besides: The minimal length k^n is proven to be always achievable.

    Which consequently means at least your first result and hence your algorithm is wrong.

    update
    De Bruijn is supposed to be cyclic!

    AAACCCBCCACBBCBACABCAABBBABAA

    Cheers Rolf
    (addicted to the Perl Programming Language and ☆☆☆☆ :)
    Je suis Charlie!

      De Bruijn is supposed to be cyclic AAACCCBCCACBBCBACABCAABBBABAA

      Yes. The last $n-1 chars overlap with the beginning. The debug prints show each permutation lined up with the output string, so you can see how the graph is traversed and the output constructed. I could have just used the last char of each permutation, and had a true De Bruijn output, but that didn't suit my needs.

      I found the Python code, but didn't like it because of the lack of explanation. To understand it, I have to trace code that does a lot of index math, instead of reading comments hinting at an algorithm that makes sense on a higher level. It's not far removed from pointer math and assembler code, so it doesn't seem very Pythonic, as they like to say.

      If nothing else, my code at least tries to say what it's doing. So the next guy can try to understand it, and not blindly accept it, nor need to walk through on paper to believe that it's correct.

      -QM
      --
      Quantum Mechanics: The dreams stuff is made of

        > . The last $n-1 chars overlap with the beginning
        > debruijn.pl 5 2 AAEEDECEBEADDCDBDACCBCABBA 123456789012345678901234567890

        I don't understand how $n=5 (Alphabet size according to your code) results in length 26 instead of 25.

        please explain°

        edit

        >  walk through on paper to believe that it's correct.

        Your user image shows Richard P. Feynman and you are questioning scientific reviews in mathematics? Ironic :)

        UPDATE

        ") did you mean $t-1 ?

        Cheers Rolf
        (addicted to the Perl Programming Language and ☆☆☆☆ :)
        Je suis Charlie!

        > so it doesn't seem very Pythonic, as they like to say.

        seems to be a port of the ruby version which is pretty straight forward ported for a Perl hacker

        http://gist.github.com/jonelf/3423148

        update

        Basically it's a inductive solution building complex sequences from more primitive ones starting with (1,1).

        I'm sure porting to Perl, using longer var names and dumping part solutions would explain the idea.

        update

        Though finding the original article would be better.

        Cheers Rolf
        (addicted to the Perl Programming Language and ☆☆☆☆ :)
        Je suis Charlie!

      De Bruijn is supposed to be cyclic! AAACCCBCCACBBCBACABCAABBBABAA

      Maybe, but practicality means it is far simpler and more efficient to repeat the end sequence than to wrap around.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority". The enemy of (IT) success is complexity.
      In the absence of evidence, opinion is indistinguishable from prejudice. Suck that fhit
        Maybe in this case, but check the length of his results, either way they are inconsistent and hence wrong.

        Cheers Rolf
        (addicted to the Perl Programming Language and ☆☆☆☆ :)
        Je suis Charlie!

Re: Generic De Bruijn Sequence
by tybalt89 (Monsignor) on Apr 19, 2017 at 23:37 UTC

    You wanted tweaks :) This is at least smaller.

    #!/usr/bin/perl -l # http://perlmonks.org/?node_id=1188292 use strict; use warnings; my $n = shift // 3; # alphabet size my $t = shift // $n; # size of tuples my @alphabet = ('A'..'Z', '0'..'9', 'a'..'z'); my (%next, %all); @next{'', @alphabet} = @alphabet; $_ = $alphabet[0] x $t; # start of string my $over = $alphabet[$n]; # outside of alphabet my $need = $n ** $t + $t - 1; 1 while s/^(?=(.{$t})) (?=.+\1) (.) | $over(.)/$next{$+}/x # advance c +har or s/^ (?=.{$t}) (?!.{$need}) /$alphabet[0]/x; # add new char print; # then prove it is correct $all{$1}++ while /(?=(.{$t}))/g; print "want @{[ $n ** $t ]} unique tuples, have @{[ scalar keys %all ] +}";
Re: Generic De Bruijn Sequence
by tybalt89 (Monsignor) on Apr 20, 2017 at 13:43 UTC

    Other replies mentioned the "prefer ones" algorithm. If you think, instead, of it as a "prefer last" algorithm, it appears to work for your stated problem. Start with T of the first character. Then try to add the last character to the answer, moving from last character to second last character to third last character, etc., as long as a repetition exists.

    #!/usr/bin/perl -l # http://perlmonks.org/?node_id=1188292 use strict; use warnings; my $n = shift // 3; # alphabet size my $t = shift // $n; # size of tuples my @alphabet = ('A'..'Z', '0'..'9', 'a'..'z'); my %previous; @previous{@alphabet[1..$#alphabet]} = @alphabet; # last to 2nd last, e +tc. my $need = $n ** $t + $t - 1; # length of the answer $_ = $alphabet[0] x $t; # start string with first chars printf "%77s\n", $_ while s/^ (?=(.{$t})) (?=.+\1) . /$previous{$&}/x # prev char if repeat or $need > length && s/^/$alphabet[$n - 1]/; # or add last char print; # the answer my $chars = join '', @alphabet[0..$n-1]; # test if valid /^[$chars]+$/ or die "invalid character"; my %all; $all{$1}++ while /(?=(.{$t}))/g; print "\nwant @{[ $n ** $t ]} unique tuples, have @{[ scalar keys %all + ]}"; print "solution passes tests";

    I haven't found a case yet where backtracking is required.

      Is this a guess or do you have a formal proof?

      Took me a while to understand why prefer "ones succeeds", and AFAIK only if the alphabet is a power of two.

      Cheers Rolf
      (addicted to the Perl Programming Language and ☆☆☆☆ :)
      Je suis Charlie!

        It's a guess.

        Does "prefer ones" succeed regardless of the alphabet size?

        My solution is, I think, a "prefer last" solution. Testing a few basic, low valued, parameters appears to give the correct answer.

        Perhaps if someone has a good head for this, they could suggest some parameter values that break under "prefer last" (or indeed, any other method). I don't have the time to tinker, and besides, I'm not that good at poking around the edges.

        -QM
        --
        Quantum Mechanics: The dreams stuff is made of

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: CUFP [id://1188292]
Approved by herveus
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others wandering the Monastery: (5)
As of 2024-04-23 15:11 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found