Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

compare a list against multiple lists

by raiten (Acolyte)
on Mar 19, 2013 at 21:09 UTC ( #1024389=perlquestion: print w/ replies, xml ) Need Help??
raiten has asked for the wisdom of the Perl Monks concerning the following question:

Hello,

I want to compare a list of keywords against multiple lists and get output like how many matches for each unique list, for 2 lists, 3 lists, ... keywords are from text file (with \n\t,; separator, but could be also dbi database in future. must scaled as each list can go for thousand to hundred thousand keywords

reading from text file seems easy. not sure about performance:
http://www.perlmonks.org/?node_id=45868
http://stackoverflow.com/questions/761392/easiest-way-to-open-a-text-file-and-read-it-into-an-array-with-perl

While googling list compare, I found this 2 interesting solutions:
http://stackoverflow.com/questions/720482/how-can-i-verify-that-a-value-is-present-in-an-array-list-in-perl
http://search.cpan.org/~jkeenan/List-Compare-0.37/lib/List/Compare.pm#Multiple_Case:_Compare_Three_or_More_Lists

List::Compare seems the most promising, just have to optimised the text file to array part.

use List::Compare; ## Al being the referenced list compare to others @Al = qw(abel abel baker camera delta edward fargo golfer jerky); @Bob = qw(baker camera delta delta edward fargo golfer hilton); @Carmen = qw(fargo golfer hilton icon icon jerky kappa); @Don = qw(fargo icon jerky); @Ed = qw(fargo icon icon jerky); my %list = (0 => 'Al', 1 => 'Bob', 2 => 'Carmen', 3 => 'Don', 4 => 'Ed +'); $lcm = List::Compare->new(\@Al, \@Bob, \@Carmen, \@Don, \@Ed); if (@intersectionAll = $lcm->get_intersection) { $all = (@intersectionAll); } for (my $j = 1; $j < 5; ++$j) { $lcm0 = List::Compare->new(\@{$list{0}}, \@{$list{$j}}); $intername = "intersection-0-$j"; if (@{$intername} = $lcm0->get_intersection) { ${"count-$intername"} = (@{$intername}); } } ## howto get keywords count which are in 2 lists, 3 lists, ... ? my $out = ""; for (my $k = 1; $k < 5; ++$k) { $out .= "count-$list{$k}:".${"count-intersection-0-$k"}." "; } $out .= " all:$all\n"; print $out;
but how to make it for keywords count in multiple list, so output is
count-Bob:6 count-Carmen:3 count-Don:2 count-Ed:2 count2+:0 count3+:2

count3+ representing how many keywords at least in 3 lists.

Thanks a lot. Cheers

Comment on compare a list against multiple lists
Select or Download Code
Re: compare a list against multiple lists
by LanX (Canon) on Mar 19, 2013 at 23:24 UTC
Re: compare a list against multiple lists
by rjt (Deacon) on Mar 19, 2013 at 23:41 UTC

    Is this sort of what you had in mind?

    The Code

    #!/usr/bin/env perl use 5.014; use warnings; my %kw; # $kw{kw}->{name} = Number of times kw appears in name.txt my %name; # Inverse of %kw; $name{name}->{keyword} read_words() for <*.txt>; # Now we can do all sorts of useful things with the two hashes: say "$_ has " . (keys $name{$_}) . " unique words" for sort keys %name +; say ''; # Keywords ordered by occurrence count for my $kw (sort { keys $kw{$b} <=> keys $kw{$a} } keys %kw) { my $count = keys $kw{$kw}; printf "%10s appears in %2d file%s: %s\n", $kw, $count, $count > 1 ? 's' : ' ', join(', ', sort keys %{$kw{$kw}}); } # Pull in the word lists. sub read_words { open my $fh, '<', $_ or die "Can't open $_: $!"; my $name = s/\.txt$//r; while (<$fh>) { chomp; $kw{$_}->{$name}++; $name{$name}->{$_}++; } close $fh; }

    Input

    Reads all *.txt files in the current directory. Each text file is expected to contain exactly one keyword per line. For example:

    al.txt: abel abel baker camera delta edward fargo golfer jerky

    Output

    al has 8 unique words bob has 7 unique words carmen has 6 unique words don has 3 unique words ed has 3 unique words fargo appears in 5 files: al, bob, carmen, don, ed jerky appears in 4 files: al, carmen, don, ed icon appears in 3 files: carmen, don, ed golfer appears in 3 files: al, bob, carmen edward appears in 2 files: al, bob camera appears in 2 files: al, bob delta appears in 2 files: al, bob hilton appears in 2 files: bob, carmen baker appears in 2 files: al, bob kappa appears in 1 file : carmen abel appears in 1 file : al

    Efficiency

    Memory: O(2nc) where n is the number of unique keywords, and c is the keyword length.

    Execution: Most operations become near-O(1) (constant time), including counting number of keywords. Obviously looping to display each keyword as I have done will incur n total lookups; this is the best possible order.

      That's an impressive piece of code and so small. Thanks a lot.

      Almost perfect for what I'm looking for. Get the output. I need to try to restrain to one reference file which I compare with others, here al.txt for example.

      After a bit of customization

      ## only display for referenced file my %counting_list; for my $key (keys $name{$ref}) { my $count = $name{$ref}{$key}-1; ## -1 to remove reference file #print "n: $key, c $count\n"; #print Dumper(\$name{$ref}); if ($verbose == 1) { my $count = keys $kw{$key}; printf "=> '%s' appears in %2d file%s: '%s'\n", $key, $count, $count > 1 ? 's' : ' ', join(', ', sort keys %{$kw{$key}}); if ($count == 2) { my $mylist; foreach my $k (keys %{$kw{$key}}) { if (!($k eq $ref)) { $mylist = $k; } } $counting_list{$mylist}++; } elsif ($count == 3) { $counting_list{'2lists'}++; } elsif ($count == 4) { $counting_list{'3lists'}++; } elsif ($count > 4) { $counting_list{'4more'}++; } } } } ## summary output my $max = keys (%filelist); $filelist{ $max } = '2lists'; $filelist{ $max+1 } = '3lists'; $filelist{ $max+2 } = '4more'; foreach my $list (keys %filelist) { #print "nolist:X list1a:X list1b:X list2+:X list3+:X\n"; if ($counting_list{ $filelist{$list} }) { print "$filelist{$list}:$counting_list{ $filelist{$lis +t} } "; } else { print "$filelist{$list}:0 "; } } print "\n";

        Glad to help. As for your next question, is this what you're after?

        abel appears 0 times in 0 files (not counting al.txt) baker appears 1 times in 1 files (not counting al.txt) camera appears 1 times in 1 files (not counting al.txt) delta appears 2 times in 1 files (not counting al.txt) edward appears 1 times in 1 files (not counting al.txt) fargo appears 4 times in 4 files (not counting al.txt) golfer appears 2 times in 2 files (not counting al.txt) jerky appears 3 times in 3 files (not counting al.txt)

        If so, try the following code somewhere below the call to read_words():

        my $name = 'al'; for my $kw (sort keys $name{$name}) { my $files = -1 + keys $kw{$kw}; # Do not include original file my $count = -$kw{$kw}->{$name}; $count += $kw{$kw}->{$_} for keys $kw{$kw}; printf "%10s appears %2d times in %2d files (not counting %s.txt)\ +n", $kw, $count, $files, $name; }

        I confess I really don't see what you're trying to do with the $filelist{$max+1} stuff. If I'm off the mark, above, might I suggest you post some specific sample output you'd like to see, given the original inputs. Pseudo code is fine, but not if we don't know what it's supposed to look like. :-)

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1024389]
Approved by ww
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others pondering the Monastery: (11)
As of 2014-08-29 06:46 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The best computer themed movie is:











    Results (276 votes), past polls