Need to get the intersect of hashes

jbush82 has asked for the wisdom of the Perl Monks concerning the following question:

I've finally had some time to sit down and work on an MD5 scanner script which I posted last week (http://www.perlmonks.org/?node_id=685695), and I'm trying to use suggestions that were given to me to go through it and write it correctly.

I now have two hashes, one containing a list of known bad filenames and their associated MD5 values, and the other containing a list of all files on a given system and their paths (note that both hashes can have multiple values per key).

What I need to do now, is get the intersect of the keys of each hash so that I have a list of files from the system that match the list of files that are known to be bad. My confusion is how correctly get the intersect so that I can easily use the values of each key from both hashes(because I'll need the paths and the known bad md5 values to do the actual MD5 check).

Any suggestions or direction is appreciated.

#!C:\perl\bin\perl.exe -w
use strict;

my @known_bad;                                                    #eac
+h element is a line within the knownbad.txt file
open(FILE, "knownbad.txt") or die("Unable to open file");
    @known_bad = <FILE>;
close(FILE);

my $bad_data;
my $bad_file;
my $bad_md5;
my $bad_file_array_element;
my %bad_file_md5;

foreach $bad_data (@known_bad) {                                      
+  #take data from knownbad.txt file and parse it into a hash
    chomp($bad_data);
    ($bad_file, $bad_md5) = split(/\,/, $bad_data);
    push(@{ $bad_file_md5{"$bad_file"} }, "$bad_md5");
}

my $system_file_location;
my $system_file;
my %system_file_data;

#open FILES, "psexec.exe -n 2 \\\\192.168.1.10 cmd.exe \/C dir C\:\\ \
+/S \/B |" or die;
open FILES, "cmd.exe \/C dir C\:\\ \/S \/B |" or die;                 
+               #take data from directory listing and parse it into a 
+hash
    while ( <FILES> ) {
        ( $system_file_location, $system_file ) = m/(.*)[\\\/](.+)/ ? 
+( $1, $2 ) : ( undef, $_ ); 
    #    print "$system_file is in the directory $system_file_location
+\n";
        push(@{ $system_file_data{"$system_file"} }, "$system_file_loc
+ation");
    }
close FILES;
[download]

$VAR1 = {
          'arbies.dll' => [
                            '388B8FBC36A8558587AFC90FB23A3B99'
                          ],
          'psexec.exe' => [
                            '78A2C9D79C21DDFCB7CED32F5EBEC618',
                            '388B8FBC36A8558587AFC90FB23A3B99'
                          ],
          'notepad.exe' => [
                             '388B8FBC36A8558587AFC90FB23A3B99'
                           ],
          'angelfood.txt' => [
                               '388B8FBC36A8558587AFC90FB23A3B99'
                             ]
        };
[download]

Comment on Need to get the intersect of hashes Select or Download Code

Replies are listed 'Best First'.
Re: Need to get the intersect of hashes by grizzley (Chaplain) on May 15, 2008 at 07:39 UTC
Isn't it that you get every key from one hash and check if it exists in the other hash? `@keys_existing_in_both = grep defined $hash2{$_}, keys %hash1;` [download]	[reply] [d/l]
Re^2: Need to get the intersect of hashes by Anonymous Monk on May 15, 2008 at 16:48 UTC
As noted below, exists should be used here, not defined.	[reply]
Re^2: Need to get the intersect of hashes by jbush82 (Novice) on May 15, 2008 at 07:54 UTC
Yes, that does give me the intersect of the keys. What I need to do is take each key in the intersect array (@keys_existing_in_both in your example) and act on each value associated with each key. Thats where I'm confused. For example, lets say the the array in your example has the element psexec.exe. What I need to do is search the second hash (the one containing the system files) for psexec and then run a system command on each value associated with the psexec.exe key. Take the results of that data (the md5 of the file) and compare it to the other values ins the psexec.exe key in the first hash (known bad data).	[reply]
Re^3: Need to get the intersect of hashes by moritz (Cardinal) on May 15, 2008 at 08:15 UTC
`for my $k (@keys_existing_in_both) { my $exec = $hash2{$k}; # do something with $k my $result = md5($k); if ($result ne $hash2{$k}){ print "Hash sum miss match for '$k'!\n"; } }` [download] (BTW in the general case `exists $hash{$key}` checks if an key exists in a hash, not `defined $hash{$key}`.)	[reply] [d/l] [select]
Re^4: Need to get the intersect of hashes by jbush82 (Novice) on May 15, 2008 at 08:30 UTC
Re^3: Need to get the intersect of hashes by grizzley (Chaplain) on May 15, 2008 at 08:12 UTC
You don't need to search in hash. If you have a key, you just retrieve the value connected with key. Can you print both structures, which you have, with help of Data::Dumper and append to your question?	[reply]
Re^4: Need to get the intersect of hashes by jbush82 (Novice) on May 15, 2008 at 08:28 UTC
Re^5: Need to get the intersect of hashes by moritz (Cardinal) on May 15, 2008 at 08:37 UTC
Some notes below your chosen depth have not been shown here
Re^5: Need to get the intersect of hashes by grizzley (Chaplain) on May 15, 2008 at 08:40 UTC
Some notes below your chosen depth have not been shown here
Re^5: Need to get the intersect of hashes by ikegami (Patriarch) on May 15, 2008 at 15:11 UTC
Re: Need to get the intersect of hashes by pc88mxer (Vicar) on May 15, 2008 at 15:48 UTC
It seems to me that a better way to go about this would be to use the MD5 signatures as the keys of your hash instead of the file names. The issue is that if a file is bad (by which I presume you mean contains a virus), then you wold want to know about it regardless of what it was named. Your search would then go like this: `my %bad_file; for each bad file: $bad_file{ md5 of bad file } = 1; for each system file: my $md5 = md5 of system file if ($bad_file{$md5}) { report this system file }` [download] There is the possibility of getting some false positives, but that's better than not reporting hits simply because the file names don't agree.	[reply] [d/l]


Perl: the Markov chain saw
	PerlMonks