I needed to write a sequence for a test, where there are N elements, and all T-tuples of elements are used, in the shortest sequence. A search turned up De Bruijn sequences.
The easy algorithm takes N elements and produces every N-tuple permutation. With a little tinkering, I have the B(N,T) version. Not fast, can't handle large sequences, and runs out of space quickly. But for small T, does what I need.
It quickly became apparent that large T values would not be useful in my test, though N could be about 50.
Here are some results:
> debruijn.pl 3 3
AAACCCBCCACBBCBACABCAABBBABAA
> debruijn.pl 4 4
AAAADDDDCDDDBDDDADDCCDDCBDDCADDBCDDBBDDBADDACDDABDDAADCDCDBDCDADCCCDCC
+BDCCADCBCDCBBDCBADCACDCABDCAADBDBDADBCCDBCBDBCADBBCDBBBDBBADBACDBABDB
+AADADACCDACBDACADABCDABBDABADAACDAABDAAACCCCBCCCACCBBCCBACCABCCAACBCB
+CACBBBCBBACBABCBAACACABBCABACAABCAAABBBBABBAABABAAA
> debruijn.pl 4 2
AADDCDBDACCBCABBA
> debruijn.pl 5 2
AAEEDECEBEADDCDBDACCBCABBA
Here's the code.
#!/usr/bin/env perl
#
# Generate a string with the longest non-repeating subsequences possib
+le.
# Include overlaps.
#
# Input N, size of the alphabet.
# Input T, tuples (pairs, triples, quadruples, etc.)
#
# Start with all N-length permutations.
# Create a graph of all pairs A(B...Y)Z,
# such that every pair whose left member ends with B...Y,
# the right member starts with B...Y (for some length n-1)
# Find an Eulerian path through the permutations (visit every node onl
+y once)
# The sequence of starting node, plus each additional ending letter,
# is the De Bruijn sequence for this alphabet.
use strict;
use warnings;
my $n = (shift or 4); # N for alphabet size
my $t = (shift or $n); # T for Tuples (pairs, triples, quadruples)
my $n1 = $n - 1;
my $t1 = $t - 1;
my @alphabet = ('A'..'Z','0'..'9','a'..'z');
if (@alphabet < $n) {
die "Alphabet is smaller than $n\n";
}
# glob character of length 1
my $alphabet = '{' . join(',', @alphabet[0..$n1]) . '}';
# Generate all strings of length $t in the given alphabet
my $glob_string = $alphabet x $t;
my @nodes = glob("$glob_string");
# Generate the graph of all strings that overlap in t-1 characters.
my %graph;
for my $node1 (@nodes) {
for my $node2 (@nodes) {
next if $node1 eq $node2;
# If they overlap, add node2 to the array for node1
if (substr($node1,1,$t1) eq substr($node2,0,$t1)) {
push @{$graph{$node1}}, $node2;
}
}
}
# String starts with first node's full string.
# Walk through the graph:
# Delete the node behind
# Add the last char of next node to string
# Print result
my $node1 = $nodes[0];
my $q = 0;
# print "$q : $node1\n"; # debug
my $string = $node1;
while (scalar keys %graph > 1) {
my $moved = 0;
# "reverse" here somehow "does the right thing", and enables an
# Eulerian circuit with no added logic.
for my $node2 (reverse @{$graph{$node1}}) {
if (exists($graph{$node2})) {
$string .= substr($node2,$t1,1); # Add last char to string
delete($graph{$node1});
$node1 = $node2;
$moved = 1;
# print ++$q, " : $node2\n"; # debug
last;
}
}
# Avoid endless loops on pathological cases
unless ($moved) {
warn "Didn't find next node ($node1)\n";
last;
}
}
print "$string\n";
exit;
I'm interested in tweaks to make it faster, smaller, better, etc. Or pointers to other solutions.
-QM
--
Quantum Mechanics: The dreams stuff is made of
Re: Generic De Bruijn Sequence
by BrowserUk (Patriarch) on Apr 19, 2017 at 17:59 UTC
|
Here's a version I wrote for (OT) A different kind of 'combinatorics'. It'll theoretically handle up to a 63 bit DeBruijn sequence, but that would require 8 Etabytes of ram, so is untested :) It has been tested on 31-bit sequence, which require 2GB of ram and completes in under 30 minutes:
#! perl -slw
use strict;
use bytes;
use Data::Dump qw[ pp ]; $Data::Dump::WIDTH = 1000;
### Prefer Ones:
### Write n zeros.
### Then, always write a one unless it would cause the repetition of a
+n n-length string;
### In which case, write a zero.
our $N //= 4;
my $t1 = "b${ \(2**$N+$N-1) }";
my $seen = '';
my $mask1 = ( 1<<$N )-1;
my $mask2 = ( 1<<( 2**$N+$N-1 ) )-1;
my $seq = pack 'Q*', (0) x 100;
my $val = 0;
for( $N .. 2**$N+$N-1 ) { ## ## if N=5; 5 .. 36; if N=6, 6 .. 64+6-
+1 = 69;
$val = ( $val << 1 ) & $mask1;
vec( $seen, $val | 1, 1 ) or do{ $val |= 1; vec( $seq, $_, 1 ) = 1
+; };
vec( $seen, $val , 1 ) = 1;
}
print unpack $t1, $seq;
__END__
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
In the absence of evidence, opinion is indistinguishable from prejudice.
Suck that fhit
| [reply] [d/l] |
|
I found a similar post in my search here earlier, but not that one. Probably some issue with whitespace and searching.
-QM
--
Quantum Mechanics: The dreams stuff is made of
| [reply] |
|
| [reply] |
|
Re: Generic De Bruijn Sequence
by tybalt89 (Monsignor) on Apr 19, 2017 at 21:43 UTC
|
Here's a solution.
I'll leave it to you to determine if it's faster, smaller, better, etc.
#!/usr/bin/perl -l
# http://perlmonks.org/?node_id=1188292
use strict;
use warnings;
my $n = shift // 3; # alphabet size
my $t = shift // $n; # size of tuples
my @alphabet = ('A'..'Z', '0'..'9', 'a'..'z');
my %next;
@next{'', @alphabet} = @alphabet;
$_ = $alphabet[0] x $t; # start of string
my $over = $alphabet[$n]; # outside of alphabet
my $wantedlength = $n ** $t + $t - 1;
while( length $_ >= $t )
{
if( /^(?=(.{$t})).+\1/ ) { s/^./$next{$&}/; }
elsif( s/$over(.)/$next{$1}/ ) { }
elsif( $wantedlength == length $_ )
{
print;
# prove it is correct
my %all;
$all{$1}++ while /(?=(.{$t}))/g;
my $count = keys %all;
print "want @{[ $n ** $t ]} unique strings, have $count";
die "duplicate string" if grep $_ > 1, values %all;
exit;
}
else { $_ = $alphabet[0] . $_; }
}
The loop seems awkward, there ought to a cleaner way...
| [reply] [d/l] |
Re: Generic De Bruijn Sequence
by LanX (Saint) on Apr 19, 2017 at 22:49 UTC
|
Did you consult wikipedia or search for other algorithms?
Many sources recommend an algorithm from Frank Ruskey , easily found implemented in various languages, here one in Python De_Bruijn_sequence#Algorithm which is not too hard to be ported.
Besides: The minimal length k^n is proven to be always achievable.
Which consequently means at least your first result and hence your algorithm is wrong.
update
De Bruijn is supposed to be cyclic!
AAACCCBCCACBBCBACABCAABBBABAA
| [reply] |
|
De Bruijn is supposed to be cyclic AAACCCBCCACBBCBACABCAABBBABAA
Yes. The last $n-1 chars overlap with the beginning. The debug prints show each permutation lined up with the output string, so you can see how the graph is traversed and the output constructed. I could have just used the last char of each permutation, and had a true De Bruijn output, but that didn't suit my needs.
I found the Python code, but didn't like it because of the lack of explanation. To understand it, I have to trace code that does a lot of index math, instead of reading comments hinting at an algorithm that makes sense on a higher level. It's not far removed from pointer math and assembler code, so it doesn't seem very Pythonic, as they like to say.
If nothing else, my code at least tries to say what it's doing. So the next guy can try to understand it, and not blindly accept it, nor need to walk through on paper to believe that it's correct.
-QM
--
Quantum Mechanics: The dreams stuff is made of
| [reply] |
|
> . The last $n-1 chars overlap with the beginning
> debruijn.pl 5 2
AAEEDECEBEADDCDBDACCBCABBA
123456789012345678901234567890
I don't understand how $n=5 (Alphabet size according to your code) results in length 26 instead of 25.
please explain°
edit
> walk through on paper to believe that it's correct.
Your user image shows Richard P. Feynman and you are questioning scientific reviews in mathematics? Ironic :)
UPDATE
") did you mean $t-1 ?
| [reply] [d/l] |
|
|
> so it doesn't seem very Pythonic, as they like to say.
seems to be a port of the ruby version which is pretty straight forward ported for a Perl hacker
http://gist.github.com/jonelf/3423148
update
Basically it's a inductive solution building complex sequences from more primitive ones starting with (1,1).
I'm sure porting to Perl, using longer var names and dumping part solutions would explain the idea.
update
Though finding the original article would be better.
| [reply] |
|
| [reply] |
|
Maybe in this case, but check the length of his results, either way they are inconsistent and hence wrong.
| [reply] |
|
|
|
Re: Generic De Bruijn Sequence
by tybalt89 (Monsignor) on Apr 19, 2017 at 23:37 UTC
|
#!/usr/bin/perl -l
# http://perlmonks.org/?node_id=1188292
use strict;
use warnings;
my $n = shift // 3; # alphabet size
my $t = shift // $n; # size of tuples
my @alphabet = ('A'..'Z', '0'..'9', 'a'..'z');
my (%next, %all);
@next{'', @alphabet} = @alphabet;
$_ = $alphabet[0] x $t; # start of string
my $over = $alphabet[$n]; # outside of alphabet
my $need = $n ** $t + $t - 1;
1 while s/^(?=(.{$t})) (?=.+\1) (.) | $over(.)/$next{$+}/x # advance c
+har
or s/^ (?=.{$t}) (?!.{$need}) /$alphabet[0]/x; # add new char
print; # then prove it is correct
$all{$1}++ while /(?=(.{$t}))/g;
print "want @{[ $n ** $t ]} unique tuples, have @{[ scalar keys %all ]
+}";
| [reply] [d/l] |
Re: Generic De Bruijn Sequence
by tybalt89 (Monsignor) on Apr 20, 2017 at 13:43 UTC
|
Other replies mentioned the "prefer ones" algorithm. If you think, instead, of
it as a "prefer last" algorithm, it appears to work for your stated problem. Start with
T of the first character. Then try to add the last character to the answer,
moving from last character to second last character to third last character, etc., as long as a repetition exists.
#!/usr/bin/perl -l
# http://perlmonks.org/?node_id=1188292
use strict;
use warnings;
my $n = shift // 3; # alphabet size
my $t = shift // $n; # size of tuples
my @alphabet = ('A'..'Z', '0'..'9', 'a'..'z');
my %previous;
@previous{@alphabet[1..$#alphabet]} = @alphabet; # last to 2nd last, e
+tc.
my $need = $n ** $t + $t - 1; # length of the answer
$_ = $alphabet[0] x $t; # start string with first chars
printf "%77s\n", $_ while
s/^ (?=(.{$t})) (?=.+\1) . /$previous{$&}/x # prev char if repeat
or $need > length && s/^/$alphabet[$n - 1]/; # or add last char
print; # the answer
my $chars = join '', @alphabet[0..$n-1]; # test if valid
/^[$chars]+$/ or die "invalid character";
my %all;
$all{$1}++ while /(?=(.{$t}))/g;
print "\nwant @{[ $n ** $t ]} unique tuples, have @{[ scalar keys %all
+ ]}";
print "solution passes tests";
I haven't found a case yet where backtracking is required.
| [reply] [d/l] |
|
| [reply] |
|
| [reply] |
|
Does "prefer ones" succeed regardless of the alphabet size?
My solution is, I think, a "prefer last" solution. Testing a few basic, low valued, parameters appears to give the correct answer.
Perhaps if someone has a good head for this, they could suggest some parameter values that break under "prefer last" (or indeed, any other method). I don't have the time to tinker, and besides, I'm not that good at poking around the edges.
-QM
--
Quantum Mechanics: The dreams stuff is made of
| [reply] |
|
|
|