Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change

Re: compare a list against multiple lists

by rjt (Deacon)
on Mar 19, 2013 at 23:41 UTC ( #1024402=note: print w/replies, xml ) Need Help??

in reply to compare a list against multiple lists

Is this sort of what you had in mind?

The Code

#!/usr/bin/env perl use 5.014; use warnings; my %kw; # $kw{kw}->{name} = Number of times kw appears in name.txt my %name; # Inverse of %kw; $name{name}->{keyword} read_words() for <*.txt>; # Now we can do all sorts of useful things with the two hashes: say "$_ has " . (keys $name{$_}) . " unique words" for sort keys %name +; say ''; # Keywords ordered by occurrence count for my $kw (sort { keys $kw{$b} <=> keys $kw{$a} } keys %kw) { my $count = keys $kw{$kw}; printf "%10s appears in %2d file%s: %s\n", $kw, $count, $count > 1 ? 's' : ' ', join(', ', sort keys %{$kw{$kw}}); } # Pull in the word lists. sub read_words { open my $fh, '<', $_ or die "Can't open $_: $!"; my $name = s/\.txt$//r; while (<$fh>) { chomp; $kw{$_}->{$name}++; $name{$name}->{$_}++; } close $fh; }


Reads all *.txt files in the current directory. Each text file is expected to contain exactly one keyword per line. For example:

al.txt: abel abel baker camera delta edward fargo golfer jerky


al has 8 unique words bob has 7 unique words carmen has 6 unique words don has 3 unique words ed has 3 unique words fargo appears in 5 files: al, bob, carmen, don, ed jerky appears in 4 files: al, carmen, don, ed icon appears in 3 files: carmen, don, ed golfer appears in 3 files: al, bob, carmen edward appears in 2 files: al, bob camera appears in 2 files: al, bob delta appears in 2 files: al, bob hilton appears in 2 files: bob, carmen baker appears in 2 files: al, bob kappa appears in 1 file : carmen abel appears in 1 file : al


Memory: O(2nc) where n is the number of unique keywords, and c is the keyword length.

Execution: Most operations become near-O(1) (constant time), including counting number of keywords. Obviously looping to display each keyword as I have done will incur n total lookups; this is the best possible order.

Replies are listed 'Best First'.
Re^2: compare a list against multiple lists
by raiten (Acolyte) on Mar 20, 2013 at 15:40 UTC

    That's an impressive piece of code and so small. Thanks a lot.

    Almost perfect for what I'm looking for. Get the output. I need to try to restrain to one reference file which I compare with others, here al.txt for example.

    After a bit of customization

    ## only display for referenced file my %counting_list; for my $key (keys $name{$ref}) { my $count = $name{$ref}{$key}-1; ## -1 to remove reference file #print "n: $key, c $count\n"; #print Dumper(\$name{$ref}); if ($verbose == 1) { my $count = keys $kw{$key}; printf "=> '%s' appears in %2d file%s: '%s'\n", $key, $count, $count > 1 ? 's' : ' ', join(', ', sort keys %{$kw{$key}}); if ($count == 2) { my $mylist; foreach my $k (keys %{$kw{$key}}) { if (!($k eq $ref)) { $mylist = $k; } } $counting_list{$mylist}++; } elsif ($count == 3) { $counting_list{'2lists'}++; } elsif ($count == 4) { $counting_list{'3lists'}++; } elsif ($count > 4) { $counting_list{'4more'}++; } } } } ## summary output my $max = keys (%filelist); $filelist{ $max } = '2lists'; $filelist{ $max+1 } = '3lists'; $filelist{ $max+2 } = '4more'; foreach my $list (keys %filelist) { #print "nolist:X list1a:X list1b:X list2+:X list3+:X\n"; if ($counting_list{ $filelist{$list} }) { print "$filelist{$list}:$counting_list{ $filelist{$lis +t} } "; } else { print "$filelist{$list}:0 "; } } print "\n";

      Glad to help. As for your next question, is this what you're after?

      abel appears 0 times in 0 files (not counting al.txt) baker appears 1 times in 1 files (not counting al.txt) camera appears 1 times in 1 files (not counting al.txt) delta appears 2 times in 1 files (not counting al.txt) edward appears 1 times in 1 files (not counting al.txt) fargo appears 4 times in 4 files (not counting al.txt) golfer appears 2 times in 2 files (not counting al.txt) jerky appears 3 times in 3 files (not counting al.txt)

      If so, try the following code somewhere below the call to read_words():

      my $name = 'al'; for my $kw (sort keys $name{$name}) { my $files = -1 + keys $kw{$kw}; # Do not include original file my $count = -$kw{$kw}->{$name}; $count += $kw{$kw}->{$_} for keys $kw{$kw}; printf "%10s appears %2d times in %2d files (not counting %s.txt)\ +n", $kw, $count, $files, $name; }

      I confess I really don't see what you're trying to do with the $filelist{$max+1} stuff. If I'm off the mark, above, might I suggest you post some specific sample output you'd like to see, given the original inputs. Pseudo code is fine, but not if we don't know what it's supposed to look like. :-)

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1024402]
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others surveying the Monastery: (3)
As of 2017-02-20 04:47 GMT
Find Nodes?
    Voting Booth?
    Before electricity was invented, what was the Electric Eel called?

    Results (293 votes). Check out past polls.