Beefy Boxes and Bandwidth Generously Provided by pair Networks Cowboy Neal with Hat
No such thing as a small change
 
PerlMonks  

Comment on

( #3333=superdoc: print w/ replies, xml ) Need Help??
jimt,
I too couldn't leave good enough alone. After realizing an embarrasing oversight (pointed out by ambrus), I decided to try my hand at a perl + Inline::C version. It finishes the 3_477 records in 1 second flat with constant memory consumption (less than 65MB). Furthermore, here is how it stacks up against the 67_108_863 lines of output from 9_448_847 lines of input (see How many words does it take? for more details):
  • Pure perl ver 1 = 343 min
  • Pure perl ver 2 = 288 min
  • Pure perl ver 3 = 251 min
  • Java = 66 min
  • Perl + Inline::C = 3 min
#!/usr/bin/perl use strict; use warnings; use Inline C =>; for my $file (@ARGV) { open(my $fh, '<', $file) or die "Unable to open '$file' for readin +g: $!"; while (<$fh>) { my ($set) = $_ =~ /^(\w+)/; powerset($set, length($set)); } } __END__ __C__ #include <stdio.h> #include <stdlib.h> void powerset (char *set, int len) { static char seen[67108864]; static int init_flag = 0; static long val[128]; int i, j, k, new_len; long bit = 0; char *oldset, *newset; if (! init_flag) { memset(seen, '0', sizeof(seen)); val['a'] = 1; val['b'] = 2; val['c'] = 4; v +al['d'] = 8; val['e'] = 16; val['f'] = 32; val['g'] = 64; v +al['h'] = 128; val['i'] = 256; val['j'] = 512; val['k'] = 1024; v +al['l'] = 2048; val['m'] = 4096; val['n'] = 8192; val['o'] = 16384; v +al['p'] = 32768; val['q'] = 65536; val['r'] = 131072; val['s'] = 262144; v +al['t'] = 524288; val['u'] = 1048576; val['v'] = 2097152; val['w'] = 4194304; v +al['x'] = 8388608; val['y'] = 16777216; val['z'] = 33554432; init_flag = 1; oldset = malloc((len + 1) * sizeof(char)); for (i = 0; i < len; ++i) { oldset[i] = set[i]; } oldset[len] = '\0'; } else { oldset = set; } for (i = 0; i < len; ++i) { bit += val[ oldset[i] ]; } if (seen[bit] == '1') { return; } seen[bit] = '1'; printf("%s\n", oldset); if (len == 1) { return; } new_len = len - 1; newset = malloc((len + 1) * sizeof(char)); for (i = 0; i < len; ++i) { k = 0; for (j = 0; j < i; ++j) { newset[k++] = oldset[j]; } for (j = i + 1; j < len; ++j) { newset[k++] = oldset[j]; } newset[k] = '\0'; powerset(newset, new_len); } free(newset); }

Update: I had solicited feedback on my poor C in the CB and was told that, despite it being a translation of the Perl above, that documentation would go along way in getting the desired comments.

For each $set in a file, the code generates the powerset (minus the empty set). It is designed to skip any subset that has previously been produced. The code to generate the powerset is intentionally inefficient from the perspective of a single $set, but more than makes up for it when the "skip previously seen subset" is applied across all sets in the file.

The algorithm to generate the powerset (ABCD) is as follows:

  • Generate all subsets of size N-1 (ABC, ABC, ACD, BCD)
  • For each new subset, repeat process until size is 1
  • Stop recursing as soon as subset has previously been seen

Because over 67 million (2 ** 26) sets will be generated, keeping track of what has previously been seen can take up a lot of space. There is a pretty straight forward way to translate a set into an integer that fits into a contigous range of 1 .. 2 ** 26. For those interested, the algorithm is the sum of 1 << ord(x) - ord('a') where x is all lowercase chars in the subset. I have created a static lookup table val[] rather than generate the values each time. In perl, I use a bitstring with vec but in C have chosen to use a char array seen[] which takes up (2 ** 26) / 1024 * 1024 or 64MB.

Since some of these arrays need to retain their values between invocations they have been made static. It is basically then just a matter of determining if the passed in string has been seen and if not, allocate a new string of size N - 1. Generate each string of size N - 1 and call self recursively.

Cheers - L~R


In reply to Re^4: Powerset short-circuit optimization by Limbic~Region
in thread Powerset short-circuit optimization by Limbic~Region

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • Outside of code tags, you may need to use entities for some characters:
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?
    Username:
    Password:

    What's my password?
    Create A New User
    Chatterbox?
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others perusing the Monastery: (16)
    As of 2014-04-23 11:09 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?

      April first is:







      Results (541 votes), past polls