Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation

Re: nested combinations: algorithm advice?

by Jasper (Chaplain)
on Sep 22, 2004 at 09:31 UTC ( #392876=note: print w/replies, xml ) Need Help??

in reply to nested combinations: algorithm advice?

Like Ted said, it's good to tell us what the code is supposed to do, instead of making us guess. Having said that, I did make a guess - the code seems to me to pick unique combinations of words where there are more than one matches?
This code produces the same results, anyway, and seems to be quicker. Less looping. I don't know what would happen if you had one_one_one and one_one_one, though.
chomp(my @lines = <DATA>); my %seen; for my $line1 (@lines) { for my $line2 (grep $_ ne $line1, @lines) { if (1 < matches($line1, $line2)) { $seen{ join ' and ', sort $line1, $line2 } = 1; } } } print "$_\n" for keys %seen; sub matches { my ($line1, $line2) = @_; my %words1 = map {$_ => 1} split /_/, $line1; my $matches = 0; $matches += ($words1{$_}||0) for split /_/, $line2; return $matches; }

I'm probably missing something obvious..
edit: s/much quicker/quicker/ :) I made a mistake in my benchmarking.

Replies are listed 'Best First'.
Re^2: nested combinations: algorithm advice?
by revdiablo (Prior) on Sep 22, 2004 at 16:44 UTC

    Thanks for the reply, and good guess! I've updated my original post to explain a bit more about what I'm trying to do.

    I like your idea, but note one thing about your solution -- it compares lines to eachother twice. When it's on line 1, it compares to line 2, and when it's on line 2, it compares to line 1 (even though it already has). That's why tye's combinations sub is really nice, it cleanly eliminates that problem. I'm not sure if using combinations is faster or slower (intuition tells me it would be faster, since it's comparing less, but perhaps there's some overhead getting in the way), but the duplication was driving me crazy.

      Yes, I tried doing the seen sort join thing before the matches thereby avoiding the 2 - 1, 1 - 2 overhead, but with the data given, there was no benefit (in fact it was slower).

      I did benchmark your code against mine, and on the DATA, mine was about 40% faster. Possibly on a much more extensive set of data, the difference wouldn't be so great.
        with the data given, there was no benefit

        The given data was only a sample. [Yes, I probably should have said this. In retrospect, there are quite a few things I could have done better with my original question.] The actual data has several thousand lines. Yours may still be faster, but then there would be the additional step of eliminating the duplicate combinations.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://392876]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chilling in the Monastery: (9)
As of 2019-12-13 15:35 GMT
Find Nodes?
    Voting Booth?

    No recent polls found