http://www.perlmonks.org?node_id=580758


in reply to Re^3: Powerset short-circuit optimization
in thread Powerset short-circuit optimization

jimt,
I am sad to tell you that despite providing an iterative version that need not be called more than necessary, it is terribly slow. I timed (not Benchmarked) 4 versions and unfortunately this wasn't even a contender:

The code to generate the 3_477 line data file and the recursive java version can be found at How many words does it take?. The two recursive perl versions are below:

They both share the following code:
#!/usr/bin/perl use strict; use warnings; my %seen; for my $file (@ARGV) { open(my $fh, '<', $file) or die "Unable to open '$file' for readin +g: $!"; while (<$fh>) { my ($set) = $_ =~ /^(\w+)/; powerset($set); } } sub powerset { my $set = shift @_; return if $seen{$set}++; print "$set\n"; powerset($_) for subsets($set); }
The 13 second version has subsets() as
sub subsets { my $set = shift @_; return if length($set) == 1; my ($head, $char, $tail) = ($set, '', ''); my @ret; while ($head) { $char = chop $head; push @ret, $head . $tail; $tail = $char . $tail; } return @ret; }
The 28 second version has subsets() as
sub subsets { my $set = shift @_; return if length($set) == 1; my @list = split //, $set; my $pos = @list; my @ret; while ($pos--) { push @ret, join '', @list[grep $_ != $pos, 0 .. $#list]; } return @ret; }

I made minor modifications to your code to handle my dataset as well as produce comparable output:

# All references to $calls removed # $limbic_sets = [ ... ] # foreach my $limbic_set (@$limbic_sets) { ... } # The above two lines became open(my $fh, '<', 'phase1.data') or die $!; while ( <$fh> ) { my ($limbic_set) = $_ =~ /^(\w+)/; $limbic_set = [ split //, $limbic_set ]; # ... } # removed print "checks set @$limbic_set\n"; # my $format = "%2s" x scalar(@$padded_limbic_set) . " (%d)\n"; # printf($format, (map {defined $_ && $display->{$_} ? $_ : ' '} @$pad +ded_limbic_set), $idx); # The above 2 lines became print join '', map {defined $_ && $display->{$_} ? $_ : ''} @$padded_l +imbic_set; print "\n";

Update 1: After your 3rd update, your code finished in a respectable 78 seconds. Unfortunately it is still producing about 40% duplicates. Additionally, it doesn't produce the correct output (missing missing 92_835 strings out of 508_062). For instance 'cdglnst' does not appear at all in your output.

Update 2: After your 4th update, your code narrowly makes 3rd place with 26 seconds and correct output! I included the entire perl script I am using above to ensure we are comparing apples to apples. Admittedly, yours does scale much better with both speed and memory. Unfortunately, it still isn't quite up to the task I needed. I will have to put this in my back pocket for later though.

Cheers - L~R