Re: compare a list against multiple lists

Is this sort of what you had in mind?

The Code

#!/usr/bin/env perl
 
use 5.014;
use warnings;
 
my %kw;   # $kw{kw}->{name} = Number of times kw appears in name.txt
my %name; # Inverse of %kw; $name{name}->{keyword}
 
read_words() for <*.txt>;
 
# Now we can do all sorts of useful things with the two hashes: 
say "$_ has " . (keys $name{$_}) . " unique words" for sort keys %name
+;
say '';
 
# Keywords ordered by occurrence count
for my $kw (sort { keys $kw{$b} <=> keys $kw{$a} } keys %kw) {
    my $count = keys $kw{$kw};
    printf "%10s appears in %2d file%s: %s\n",
        $kw, $count, $count > 1 ? 's' : ' ',
        join(', ', sort keys %{$kw{$kw}});
}

# Pull in the word lists.
sub read_words { 
    open my $fh, '<', $_ or die "Can't open $_: $!";
    my $name = s/\.txt$//r;
    
    while (<$fh>) {
        chomp;
        $kw{$_}->{$name}++;
        $name{$name}->{$_}++;
    }
    close $fh;
}
[download]

Input

Reads all *.txt files in the current directory. Each text file is expected to contain exactly one keyword per line. For example:

al.txt:
abel
abel
baker
camera
delta
edward
fargo
golfer
jerky
[download]

Output

al has 8 unique words
bob has 7 unique words
carmen has 6 unique words
don has 3 unique words
ed has 3 unique words

     fargo appears in  5 files: al, bob, carmen, don, ed
     jerky appears in  4 files: al, carmen, don, ed
      icon appears in  3 files: carmen, don, ed
    golfer appears in  3 files: al, bob, carmen
    edward appears in  2 files: al, bob
    camera appears in  2 files: al, bob
     delta appears in  2 files: al, bob
    hilton appears in  2 files: bob, carmen
     baker appears in  2 files: al, bob
     kappa appears in  1 file : carmen
      abel appears in  1 file : al
[download]

Efficiency

Memory: O(2nc) where n is the number of unique keywords, and c is the keyword length.

Execution: Most operations become near-O(1) (constant time), including counting number of keywords. Obviously looping to display each keyword as I have done will incur n total lookups; this is the best possible order.

Comment on Re: compare a list against multiple lists Select or Download Code

Replies are listed 'Best First'.
Re^2: compare a list against multiple lists by raiten (Acolyte) on Mar 20, 2013 at 15:40 UTC
That's an impressive piece of code and so small. Thanks a lot. Almost perfect for what I'm looking for. Get the output. I need to try to restrain to one reference file which I compare with others, here al.txt for example. After a bit of customization ## only display for referenced file my %counting_list; for my $key (keys $name{$ref}) { my $count = $name{$ref}{$key}-1; ## -1 to remove reference file #print "n: $key, c $count\n"; #print Dumper(\$name{$ref}); if ($verbose == 1) { my $count = keys $kw{$key}; printf "=> '%s' appears in %2d file%s: '%s'\n", $key, $count, $count > 1 ? 's' : ' ', join(', ', sort keys %{$kw{$key}}); if ($count == 2) { my $mylist; foreach my $k (keys %{$kw{$key}}) { if (!($k eq $ref)) { $mylist = $k; } } $counting_list{$mylist}++; } elsif ($count == 3) { $counting_list{'2lists'}++; } elsif ($count == 4) { $counting_list{'3lists'}++; } elsif ($count > 4) { $counting_list{'4more'}++; } } } } ## summary output my $max = keys (%filelist); $filelist{ $max } = '2lists'; $filelist{ $max+1 } = '3lists'; $filelist{ $max+2 } = '4more'; foreach my $list (keys %filelist) { #print "nolist:X list1a:X list1b:X list2+:X list3+:X\n"; if ($counting_list{ $filelist{$list} }) { print "$filelist{$list}:$counting_list{ $filelist{$lis +t} } "; } else { print "$filelist{$list}:0 "; } } print "\n"; [download]	[reply] [d/l]
Re^3: compare a list against multiple lists by rjt (Curate) on Mar 21, 2013 at 23:28 UTC
Glad to help. As for your next question, is this what you're after? `abel appears 0 times in 0 files (not counting al.txt) baker appears 1 times in 1 files (not counting al.txt) camera appears 1 times in 1 files (not counting al.txt) delta appears 2 times in 1 files (not counting al.txt) edward appears 1 times in 1 files (not counting al.txt) fargo appears 4 times in 4 files (not counting al.txt) golfer appears 2 times in 2 files (not counting al.txt) jerky appears 3 times in 3 files (not counting al.txt)` [download] If so, try the following code somewhere below the call to `read_words()`: `my $name = 'al'; for my $kw (sort keys $name{$name}) { my $files = -1 + keys $kw{$kw}; # Do not include original file my $count = -$kw{$kw}->{$name}; $count += $kw{$kw}->{$_} for keys $kw{$kw}; printf "%10s appears %2d times in %2d files (not counting %s.txt)\ +n", $kw, $count, $files, $name; }` [download] I confess I really don't see what you're trying to do with the `$filelist{$max+1}` stuff. If I'm off the mark, above, might I suggest you post some specific sample output you'd like to see, given the original inputs. Pseudo code is fine, but not if we don't know what it's supposed to look like. :-)	[reply] [d/l] [select]


Clear questions and runnable code get the best and fastest answer
	PerlMonks