Generic De Bruijn Sequence

I needed to write a sequence for a test, where there are N elements, and all T-tuples of elements are used, in the shortest sequence. A search turned up De Bruijn sequences.

The easy algorithm takes N elements and produces every N-tuple permutation. With a little tinkering, I have the B(N,T) version. Not fast, can't handle large sequences, and runs out of space quickly. But for small T, does what I need.

It quickly became apparent that large T values would not be useful in my test, though N could be about 50.

Here are some results:

> debruijn.pl 3 3
AAACCCBCCACBBCBACABCAABBBABAA

> debruijn.pl 4 4
AAAADDDDCDDDBDDDADDCCDDCBDDCADDBCDDBBDDBADDACDDABDDAADCDCDBDCDADCCCDCC
+BDCCADCBCDCBBDCBADCACDCABDCAADBDBDADBCCDBCBDBCADBBCDBBBDBBADBACDBABDB
+AADADACCDACBDACADABCDABBDABADAACDAABDAAACCCCBCCCACCBBCCBACCABCCAACBCB
+CACBBBCBBACBABCBAACACABBCABACAABCAAABBBBABBAABABAAA

> debruijn.pl 4 2
AADDCDBDACCBCABBA

> debruijn.pl 5 2
AAEEDECEBEADDCDBDACCBCABBA
[download]

Here's the code.

#!/usr/bin/env perl
#
# Generate a string with the longest non-repeating subsequences possib
+le.
# Include overlaps.
#
# Input N, size of the alphabet.
# Input T, tuples (pairs, triples, quadruples, etc.)
#
# Start with all N-length permutations.
# Create a graph of all pairs A(B...Y)Z,
#   such that every pair whose left member ends with B...Y,
#   the right member starts with B...Y (for some length n-1)
# Find an Eulerian path through the permutations (visit every node onl
+y once)
# The sequence of starting node, plus each additional ending letter,
#   is the De Bruijn sequence for this alphabet.

use strict;
use warnings;

my $n = (shift or 4);  # N for alphabet size
my $t = (shift or $n); # T for Tuples (pairs, triples, quadruples)
my $n1 = $n - 1;
my $t1 = $t - 1;

my @alphabet = ('A'..'Z','0'..'9','a'..'z');

if (@alphabet < $n) {
    die "Alphabet is smaller than $n\n";
}

# glob character of length 1
my $alphabet = '{' . join(',', @alphabet[0..$n1]) . '}';

# Generate all strings of length $t in the given alphabet
my $glob_string = $alphabet x $t;
my @nodes = glob("$glob_string");

# Generate the graph of all strings that overlap in t-1 characters.
my %graph;
for my $node1 (@nodes) {
    for my $node2 (@nodes) {
        next if $node1 eq $node2;
        # If they overlap, add node2 to the array for node1
        if (substr($node1,1,$t1) eq substr($node2,0,$t1)) {
            push @{$graph{$node1}}, $node2;
        }
    }
}

# String starts with first node's full string.
# Walk through the graph:
#   Delete the node behind
#   Add the last char of next node to string
# Print result

my $node1 = $nodes[0];
my $q = 0;
# print "$q : $node1\n"; # debug
my $string = $node1;
while (scalar keys %graph > 1) {
    my $moved = 0;
    # "reverse" here somehow "does the right thing", and enables an 
    # Eulerian circuit with no added logic.
    for my $node2 (reverse @{$graph{$node1}}) {
        if (exists($graph{$node2})) {
            $string .= substr($node2,$t1,1); # Add last char to string
            delete($graph{$node1});
            $node1 = $node2;
            $moved = 1;
#           print ++$q, " : $node2\n"; # debug
            last;
        }
    }
    # Avoid endless loops on pathological cases
    unless ($moved) {
        warn "Didn't find next node ($node1)\n";
        last;
    }
}

print "$string\n";

exit;
[download]

I'm interested in tweaks to make it faster, smaller, better, etc. Or pointers to other solutions.

-QM
--
Quantum Mechanics: The dreams stuff is made of

Comment on Generic De Bruijn Sequence Select or Download Code

Replies are listed 'Best First'.
Re: Generic De Bruijn Sequence by BrowserUk (Patriarch) on Apr 19, 2017 at 17:59 UTC
Here's a version I wrote for (OT) A different kind of 'combinatorics'. It'll theoretically handle up to a 63 bit DeBruijn sequence, but that would require 8 Etabytes of ram, so is untested :) It has been tested on 31-bit sequence, which require 2GB of ram and completes in under 30 minutes: #! perl -slw use strict; use bytes; use Data::Dump qw[ pp ]; $Data::Dump::WIDTH = 1000; ### Prefer Ones: ### Write n zeros. ### Then, always write a one unless it would cause the repetition of a +n n-length string; ### In which case, write a zero. our $N //= 4; my $t1 = "b${ \(2$N+$N-1) }"; my $seen = ''; my $mask1 = ( 1<<$N )-1; my $mask2 = ( 1<<( 2$N+$N-1 ) )-1; my $seq = pack 'Q', (0) x 100; my $val = 0; for( $N .. 2*$N+$N-1 ) { ## ## if N=5; 5 .. 36; if N=6, 6 .. 64+6- +1 = 69; $val = ( $val << 1 ) & $mask1; vec( $seen, $val \| 1, 1 ) or do{ $val \|= 1; vec( $seq, $_, 1 ) = 1 +; }; vec( $seen, $val , 1 ) = 1; } print unpack $t1, $seq; __END__ [download] With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". The enemy of (IT) success is complexity. In the absence of evidence, opinion is indistinguishable from prejudice. Suck that fhit	[reply] [d/l]
Re^2: Generic De Bruijn Sequence by QM (Parson) on Apr 19, 2017 at 18:09 UTC
I found a similar post in my search here earlier, but not that one. Probably some issue with whitespace and searching. -QM -- Quantum Mechanics: The dreams stuff is made of	[reply]
Re^3: Generic De Bruijn Sequence by BrowserUk (Patriarch) on Apr 19, 2017 at 20:00 UTC
I'd forgotten I'd posted that CufP. (Are you interested in the C version if I can find it?)	[reply]
Re^4: Generic De Bruijn Sequence by QM (Parson) on Apr 19, 2017 at 22:52 UTC
Re: Generic De Bruijn Sequence by tybalt89 (Monsignor) on Apr 19, 2017 at 21:43 UTC
Here's a solution. I'll leave it to you to determine if it's faster, smaller, better, etc. #!/usr/bin/perl -l # http://perlmonks.org/?node_id=1188292 use strict; use warnings; my $n = shift // 3; # alphabet size my $t = shift // $n; # size of tuples my @alphabet = ('A'..'Z', '0'..'9', 'a'..'z'); my %next; @next{'', @alphabet} = @alphabet; $_ = $alphabet[0] x $t; # start of string my $over = $alphabet[$n]; # outside of alphabet my $wantedlength = $n $t + $t - 1; while( length $_ >= $t ) { if( /^(?=(.{$t})).+\1/ ) { s/^./$next{$&}/; } elsif( s/$over(.)/$next{$1}/ ) { } elsif( $wantedlength == length $_ ) { print; # prove it is correct my %all; $all{$1}++ while /(?=(.{$t}))/g; my $count = keys %all; print "want @{[ $n $t ]} unique strings, have $count"; die "duplicate string" if grep $_ > 1, values %all; exit; } else { $_ = $alphabet[0] . $_; } } [download] The loop seems awkward, there ought to a cleaner way...	[reply] [d/l]
Re: Generic De Bruijn Sequence by LanX (Saint) on Apr 19, 2017 at 22:49 UTC
Did you consult wikipedia or search for other algorithms? Many sources recommend an algorithm from�Frank Ruskey , easily found implemented in various languages, here one in Python De_Bruijn_sequence#Algorithm which is not too hard to be ported. Besides: The minimal length k^n is proven to be always achievable. Which consequently means at least your first result and hence your algorithm is wrong. update De Bruijn is supposed to be cyclic! `AAACCCBCCACBBCBACABCAABBBABAA` Cheers Rolf _{(addicted to the Perl Programming Language and ☆☆☆☆ :) Je suis Charlie!}	[reply]
Re^2: Generic De Bruijn Sequence by QM (Parson) on Apr 19, 2017 at 23:07 UTC
De Bruijn is supposed to be cyclic AAACCCBCCACBBCBACABCAABBBABAA Yes. The last $n-1 chars overlap with the beginning. The debug prints show each permutation lined up with the output string, so you can see how the graph is traversed and the output constructed. I could have just used the last char of each permutation, and had a true De Bruijn output, but that didn't suit my needs. I found the Python code, but didn't like it because of the lack of explanation. To understand it, I have to trace code that does a lot of index math, instead of reading comments hinting at an algorithm that makes sense on a higher level. It's not far removed from pointer math and assembler code, so it doesn't seem very Pythonic, as they like to say. If nothing else, my code at least tries to say what it's doing. So the next guy can try to understand it, and not blindly accept it, nor need to walk through on paper to believe that it's correct. -QM -- Quantum Mechanics: The dreams stuff is made of	[reply]
Re^3: Generic De Bruijn Sequence (updated) by LanX (Saint) on Apr 19, 2017 at 23:16 UTC
> . The last $n-1 chars overlap with the beginning `> debruijn.pl 5 2 AAEEDECEBEADDCDBDACCBCABBA 123456789012345678901234567890` [download] I don't understand how $n=5 (Alphabet size according to your code) results in length 26 instead of 25. please explain� edit > �walk through on paper to believe that it's correct. Your user image shows Richard P. Feynman and you are questioning scientific reviews in mathematics? Ironic :) UPDATE ") did you mean $t-1 ? Cheers Rolf _{(addicted to the Perl Programming Language and ☆☆☆☆ :) Je suis Charlie!}	[reply] [d/l]
Re^4: Generic De Bruijn Sequence (updated) by QM (Parson) on Apr 26, 2017 at 09:57 UTC
Re^3: Generic De Bruijn Sequence by LanX (Saint) on Apr 19, 2017 at 23:47 UTC
> so it doesn't seem very�Pythonic, as they like to say. seems to be a port of the ruby version which is pretty straight forward ported for a Perl hacker http://gist.github.com/jonelf/3423148 update Basically it's a inductive solution building complex sequences from more primitive ones starting with (1,1). I'm sure porting to Perl, using longer var names and dumping part solutions would explain the idea. update Though finding the original article would be better. Cheers Rolf _{(addicted to the Perl Programming Language and ☆☆☆☆ :) Je suis Charlie!}	[reply]
Re^2: Generic De Bruijn Sequence by BrowserUk (Patriarch) on Apr 19, 2017 at 23:03 UTC
De Bruijn is supposed to be cyclic! AAACCCBCCACBBCBACABCAABBBABAA Maybe, but practicality means it is far simpler and more efficient to repeat the end sequence than to wrap around. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". The enemy of (IT) success is complexity. In the absence of evidence, opinion is indistinguishable from prejudice. Suck that fhit	[reply]
Re^3: Generic De Bruijn Sequence by LanX (Saint) on Apr 19, 2017 at 23:08 UTC
Maybe in this case, but check the length of his results, either way they are inconsistent and hence wrong. Cheers Rolf _{(addicted to the Perl Programming Language and ☆☆☆☆ :) Je suis Charlie!}	[reply]
Re^4: Generic De Bruijn Sequence by BrowserUk (Patriarch) on Apr 19, 2017 at 23:48 UTC
Re^5: Generic De Bruijn Sequence by LanX (Saint) on Apr 20, 2017 at 00:46 UTC
Some notes below your chosen depth have not been shown here
Re: Generic De Bruijn Sequence by tybalt89 (Monsignor) on Apr 19, 2017 at 23:37 UTC
You wanted tweaks :) This is at least smaller. #!/usr/bin/perl -l # http://perlmonks.org/?node_id=1188292 use strict; use warnings; my $n = shift // 3; # alphabet size my $t = shift // $n; # size of tuples my @alphabet = ('A'..'Z', '0'..'9', 'a'..'z'); my (%next, %all); @next{'', @alphabet} = @alphabet; $_ = $alphabet[0] x $t; # start of string my $over = $alphabet[$n]; # outside of alphabet my $need = $n $t + $t - 1; 1 while s/^(?=(.{$t})) (?=.+\1) (.) \| $over(.)/$next{$+}/x # advance c +har or s/^ (?=.{$t}) (?!.{$need}) /$alphabet[0]/x; # add new char print; # then prove it is correct $all{$1}++ while /(?=(.{$t}))/g; print "want @{[ $n $t ]} unique tuples, have @{[ scalar keys %all ] +}"; [download]	[reply] [d/l]
Re: Generic De Bruijn Sequence by tybalt89 (Monsignor) on Apr 20, 2017 at 13:43 UTC
Other replies mentioned the "prefer ones" algorithm. If you think, instead, of it as a "prefer last" algorithm, it appears to work for your stated problem. Start with T of the first character. Then try to add the last character to the answer, moving from last character to second last character to third last character, etc., as long as a repetition exists. #!/usr/bin/perl -l # http://perlmonks.org/?node_id=1188292 use strict; use warnings; my $n = shift // 3; # alphabet size my $t = shift // $n; # size of tuples my @alphabet = ('A'..'Z', '0'..'9', 'a'..'z'); my %previous; @previous{@alphabet[1..$#alphabet]} = @alphabet; # last to 2nd last, e +tc. my $need = $n $t + $t - 1; # length of the answer $_ = $alphabet[0] x $t; # start string with first chars printf "%77s\n", $_ while s/^ (?=(.{$t})) (?=.+\1) . /$previous{$&}/x # prev char if repeat or $need > length && s/^/$alphabet[$n - 1]/; # or add last char print; # the answer my $chars = join '', @alphabet[0..$n-1]; # test if valid /^[$chars]+$/ or die "invalid character"; my %all; $all{$1}++ while /(?=(.{$t}))/g; print "\nwant @{[ $n $t ]} unique tuples, have @{[ scalar keys %all + ]}"; print "solution passes tests"; [download] I haven't found a case yet where backtracking is required.	[reply] [d/l]
Re^2: Generic De Bruijn Sequence by LanX (Saint) on Apr 20, 2017 at 16:44 UTC
Is this a guess or do you have a formal proof? Took me a while to understand why prefer "ones succeeds", and AFAIK only if the alphabet is a power of two. Cheers Rolf _{(addicted to the Perl Programming Language and ☆☆☆☆ :) Je suis Charlie!}	[reply]
Re^3: Generic De Bruijn Sequence by tybalt89 (Monsignor) on Apr 20, 2017 at 17:12 UTC
It's a guess.	[reply]
Re^3: Generic De Bruijn Sequence by QM (Parson) on Apr 26, 2017 at 10:05 UTC
Does "prefer ones" succeed regardless of the alphabet size? My solution is, I think, a "prefer last" solution. Testing a few basic, low valued, parameters appears to give the correct answer. Perhaps if someone has a good head for this, they could suggest some parameter values that break under "prefer last" (or indeed, any other method). I don't have the time to tinker, and besides, I'm not that good at poking around the edges. -QM -- Quantum Mechanics: The dreams stuff is made of	[reply]
Re^4: Generic De Bruijn Sequence by Anonymous Monk on Apr 26, 2017 at 15:26 UTC


Problems? Is your data what you think it is?
	PerlMonks

Generic De Bruijn Sequence

update

edit

UPDATE

update

update