Unless I misunderstood the question, this may be NP complete, but it doesn't take long. Leastwise not if all your numbers are small integers, < 255.
What I did was to encode the arrays of integers into strings, and then build an index to all the N-char length substrings in the 1000 x 1000-char strings. This latter process only takes a few seconds.
Less time infact than generating and encoding the test data. Reading it from a file and encoding it directly to strings would make it quicker still, by saving on the ram for all the arrays.
It takes around 27 seconds (total: Generation, encoding, indexing, counting and printing to nul!) to process 1000 x 1000-number sets on my mid-range machine.
#! perl -slw
use strict;
use Data::Dumper;
$| = 1;
$" = ', ';
our $LEN ||= 1000;
our $COUNT ||= 1000;
our $M ||= 2;
our $N ||= 2;
my @data = map {
[ map{ int rand 256 } 1 .. $LEN ]
} 1 .. $COUNT;
warn "Gen'd data";
my @encoded = map {
pack 'C*', @$_;
} @data;
warn 'Encoded data';
my %seq;
for my $i ( 0 .. $#encoded ) {
for my $o ( 0 .. $LEN - $N ) {
push @{ $seq{ substr $encoded[ $i ], $o, $N } }, $i;
}
}
my $count = 0;
for my $run ( keys %seq ) {
next unless @{ $seq{ $run } } > $M;
$count++;
printf "Run [%*s] appears in lines: %s\n",
$N * 4,
join(',', map{ sprintf '%03d', $_ } unpack 'C*', $run ),
"[@{ $seq{ $run } }]";
}
warn "Found $count $N-number runs appearing in at least $M sets";
__END__
[11:54:20.15] P:\test>412061 -LEN=1000 -COUNT=1000 -N=2 -M=2 >nul
Gen'd data at P:\test\412061.pl line 16.
Encoded data at P:\test\412061.pl line 21.
Found 65533 2-number runs appearing in at least 2 sets at P:\test\4120
+61.pl line 41.
[11:54:47.06] P:\test>412061 -LEN=1000 -COUNT=1000 -N=3 -M=2 >nul
Gen'd data at P:\test\412061.pl line 16.
Encoded data at P:\test\412061.pl line 21.
Found 553 3-number runs appearing in at least 2 sets at P:\test\412061
+.pl line 41.
[11:55:16.46] P:\test>
[11:55:23.76] P:\test>412061 -LEN=1000 -COUNT=1000 -N=3 -M=3 >nul
Gen'd data at P:\test\412061.pl line 16.
Encoded data at P:\test\412061.pl line 21.
Found 9 3-number runs appearing in at least 3 sets at P:\test\412061.p
+l line 41.
[11:55:55.84] P:\test>
[11:55:55.84] P:\test>412061 -LEN=1000 -COUNT=1000 -N=3 -M=3
Gen'd data at P:\test\412061.pl line 16.
Encoded data at P:\test\412061.pl line 21.
Run [ 215,017,145] appears in lines: [546, 657, 671, 986]
Run [ 205,013,147] appears in lines: [326, 360, 385, 975]
Run [ 134,174,216] appears in lines: [335, 512, 539, 592]
Run [ 114,015,201] appears in lines: [470, 544, 557, 989]
Run [ 065,233,176] appears in lines: [533, 656, 789, 930]
Run [ 003,013,171] appears in lines: [17, 169, 460, 979]
Run [ 191,241,173] appears in lines: [25, 88, 442, 782]
Run [ 245,165,229] appears in lines: [373, 405, 599, 697]
Run [ 072,156,003] appears in lines: [86, 153, 464, 544]
Run [ 055,130,150] appears in lines: [73, 127, 157, 384, 868]
Run [ 044,035,182] appears in lines: [3, 402, 734, 764]
Run [ 187,058,006] appears in lines: [171, 487, 924, 944]
Run [ 060,153,213] appears in lines: [72, 404, 708, 976]
Run [ 057,200,093] appears in lines: [221, 246, 514, 908]
Found 14 3-number runs appearing in at least 3 sets at P:\test\412061.
+pl line 41.
[12:04:02.68] P:\test>
Examine what is said, not who speaks.
"But you should never overestimate the ingenuity of the sceptics to come up with a counter-argument." -Myles Allen
"Think for yourself!" - Abigail
"Time is a poor substitute for thought"--theorbtwo
"Efficiency is intelligent laziness." -David Dunham
"Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon
-
Are you posting in the right place? Check out Where do I post X? to know for sure.
-
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big>
<blockquote> <br /> <dd>
<dl> <dt> <em> <font>
<h1> <h2> <h3> <h4>
<h5> <h6> <hr /> <i>
<li> <nbsp> <ol> <p>
<small> <strike> <strong>
<sub> <sup> <table>
<td> <th> <tr> <tt>
<u> <ul>
-
Snippets of code should be wrapped in
<code> tags not
<pre> tags. In fact, <pre>
tags should generally be avoided. If they must
be used, extreme care should be
taken to ensure that their contents do not
have long lines (<70 chars), in order to prevent
horizontal scrolling (and possible janitor
intervention).
-
Want more info? How to link
or How to display code and escape characters
are good places to start.