Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses

Re: nested combinations: algorithm advice?

by bmann (Priest)
on Sep 22, 2004 at 15:58 UTC ( #392943=note: print w/replies, xml ) Need Help??

in reply to nested combinations: algorithm advice?

You want to compare each line to every other line and print only those with more than one term in common, right?

Assuming that's what you're looking for, you don't need to worry about combinations. Compare the first element with every other element. Then compare the second with every other element except the first. Repeat until complete.

Here's a possibly cleaner way to the same result. If your data is more complex (it probably is ;) you may need to tweak the regex.

#!/usr/bin/perl use strict; use warnings; chomp (my @lines = <DATA>); for my $i (0 .. $#lines) { (my $re = $lines[$i]) =~ s/_/|/g; # create $re outside the inner l +oop for my $j ($i + 1 .. $#lines) { my $count = () = $lines[$j] =~ /$re/g; if ($count >= 2) { print "$lines[$i] and $lines[$j]\n"; } } } __DATA__ one_two one_three_two three_one one_four four_three_one

Replies are listed 'Best First'.
Re^2: nested combinations: algorithm advice?
by Jasper (Chaplain) on Sep 22, 2004 at 17:11 UTC
    Erg, teach me to start optimising in the most stupdi places. I have modded my code to loop like yours, and moved my hash creation outside the inner loop (what a dummy!) and ditched the sub (which was mostly there for clarity, anyway). Then we get:
    my $out; for my $i (0 .. $#lines) { my %words1 = map { $_ => 1 } split /_/, $lines[$i]; for my $j($i + 1 .. $#lines) { if (1 < grep $words1{$_}, split /_/, $lines[$j]) { $out .= "$lines[$i] and $lines[$j]\n"; } } } $out;

    Faster! Benchmark:
    Benchmark: running bmann, jasper, rev for at least 3 CPU seconds... bmann: 3 wallclock secs ( 3.17 usr + 0.00 sys = 3.17 CPU) @ 88 +28.71/s (n=27987) jasper: 3 wallclock secs ( 3.15 usr + 0.00 sys = 3.15 CPU) @ 10 +860.32/s (n=34210) rev: 3 wallclock secs ( 3.17 usr + 0.00 sys = 3.17 CPU) @ 22 +06.94/s (n=6996)
Re^2: nested combinations: algorithm advice?
by revdiablo (Prior) on Sep 22, 2004 at 16:50 UTC

    Excellent! I thought there might be a nicer way than the brute-force combinations. Also, using regexes instead of looping on the word lists is nice -- let the regex engine do the dirty work. Thanks for the reply, I will probably steal your ideas. :)

      Glad you liked it. One caveat though - craft your regex with care if you use this (or use a hash like jasper's version below, which doesn't suffer from this problem). The regex I used was naive in that it doesn't detect the boundary between the words. Add "twop_threep" to DATA, you'll see a false positive.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://392943]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others meditating upon the Monastery: (3)
As of 2020-09-19 15:57 GMT
Find Nodes?
    Voting Booth?
    If at first I donít succeed, I Ö

    Results (114 votes). Check out past polls.