Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??
jimt,
I too couldn't leave good enough alone. After realizing an embarrasing oversight (pointed out by ambrus), I decided to try my hand at a perl + Inline::C version. It finishes the 3_477 records in 1 second flat with constant memory consumption (less than 65MB). Furthermore, here is how it stacks up against the 67_108_863 lines of output from 9_448_847 lines of input (see How many words does it take? for more details):
  • Pure perl ver 1 = 343 min
  • Pure perl ver 2 = 288 min
  • Pure perl ver 3 = 251 min
  • Java = 66 min
  • Perl + Inline::C = 3 min
#!/usr/bin/perl use strict; use warnings; use Inline C =>; for my $file (@ARGV) { open(my $fh, '<', $file) or die "Unable to open '$file' for readin +g: $!"; while (<$fh>) { my ($set) = $_ =~ /^(\w+)/; powerset($set, length($set)); } } __END__ __C__ #include <stdio.h> #include <stdlib.h> void powerset (char *set, int len) { static char seen[67108864]; static int init_flag = 0; static long val[128]; int i, j, k, new_len; long bit = 0; char *oldset, *newset; if (! init_flag) { memset(seen, '0', sizeof(seen)); val['a'] = 1; val['b'] = 2; val['c'] = 4; v +al['d'] = 8; val['e'] = 16; val['f'] = 32; val['g'] = 64; v +al['h'] = 128; val['i'] = 256; val['j'] = 512; val['k'] = 1024; v +al['l'] = 2048; val['m'] = 4096; val['n'] = 8192; val['o'] = 16384; v +al['p'] = 32768; val['q'] = 65536; val['r'] = 131072; val['s'] = 262144; v +al['t'] = 524288; val['u'] = 1048576; val['v'] = 2097152; val['w'] = 4194304; v +al['x'] = 8388608; val['y'] = 16777216; val['z'] = 33554432; init_flag = 1; oldset = malloc((len + 1) * sizeof(char)); for (i = 0; i < len; ++i) { oldset[i] = set[i]; } oldset[len] = '\0'; } else { oldset = set; } for (i = 0; i < len; ++i) { bit += val[ oldset[i] ]; } if (seen[bit] == '1') { return; } seen[bit] = '1'; printf("%s\n", oldset); if (len == 1) { return; } new_len = len - 1; newset = malloc((len + 1) * sizeof(char)); for (i = 0; i < len; ++i) { k = 0; for (j = 0; j < i; ++j) { newset[k++] = oldset[j]; } for (j = i + 1; j < len; ++j) { newset[k++] = oldset[j]; } newset[k] = '\0'; powerset(newset, new_len); } free(newset); }

Update: I had solicited feedback on my poor C in the CB and was told that, despite it being a translation of the Perl above, that documentation would go along way in getting the desired comments.

For each $set in a file, the code generates the powerset (minus the empty set). It is designed to skip any subset that has previously been produced. The code to generate the powerset is intentionally inefficient from the perspective of a single $set, but more than makes up for it when the "skip previously seen subset" is applied across all sets in the file.

The algorithm to generate the powerset (ABCD) is as follows:

  • Generate all subsets of size N-1 (ABC, ABC, ACD, BCD)
  • For each new subset, repeat process until size is 1
  • Stop recursing as soon as subset has previously been seen

Because over 67 million (2 ** 26) sets will be generated, keeping track of what has previously been seen can take up a lot of space. There is a pretty straight forward way to translate a set into an integer that fits into a contigous range of 1 .. 2 ** 26. For those interested, the algorithm is the sum of 1 << ord(x) - ord('a') where x is all lowercase chars in the subset. I have created a static lookup table val[] rather than generate the values each time. In perl, I use a bitstring with vec but in C have chosen to use a char array seen[] which takes up (2 ** 26) / 1024 * 1024 or 64MB.

Since some of these arrays need to retain their values between invocations they have been made static. It is basically then just a matter of determining if the passed in string has been seen and if not, allocate a new string of size N - 1. Generate each string of size N - 1 and call self recursively.

Cheers - L~R


In reply to Re^4: Powerset short-circuit optimization by Limbic~Region
in thread Powerset short-circuit optimization by Limbic~Region

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (4)
As of 2024-03-19 02:19 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found