Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

Comparing Two Files

by walkingthecow (Friar)
on Jun 29, 2010 at 22:55 UTC ( #847211=perlquestion: print w/ replies, xml ) Need Help??
walkingthecow has asked for the wisdom of the Perl Monks concerning the following question:

Hey monks! Again I ask for your sage wisdom on a question that has been perplexing me :)

I have two files, each of which have content like below:
<host id="bobssite" root-directory="."> <host-alias>bobssite.autossl.net</host-alias> <host-alias>www.bobssite.com</host-alias> <host-alias>bobssite.dealer.net</host-alias> </host> <host id="billssite" root-directory="."> <host-alias>billssite.autossl.net</host-alias> <host-alias>www.billssite.com</host-alias> <host-alias>www.billsothersite.com</host-alias> <host-alias>bobssite.dealer.net</host-alias> </host>
What I need to do is make sure that all host IDs that exist in file1 also exist in file2; AND for each host ID, all the host-aliases starting with www also match up. So, if www.site1.com and www.site2.com exist under host ID bobjones in file1, then bobjones must exist in file2 AND www.site1.com and www.site2.com must exist for the host ID bobjones.

Here is what I have so far, but then I am not sure how to compare the two. I know the code below is not much help, but I am really at a loss here.
#!/usr/bin/perl use strict; use warnings; my %host_contents; my $host_id; while (<>) { chomp; if ($_ =~ /host id="(.*?)"/) { $host_id = $1; } if ($_ =~ m{<host-alias>(www.*?)</host-alias>}) { push @{ $host_contents{$host_id} }, $1; } }

Comment on Comparing Two Files
Select or Download Code
Re: Comparing Two Files
by ikegami (Pope) on Jun 29, 2010 at 23:09 UTC
    Read each file into the following structure
    my %file1_hosts = ( bobssite => { 'bobssite.autossl.net' => 1, 'www.bobssite.com' => 1, 'bobssite.dealer.net' => 1, }, billssite => { 'billssite.autossl.net' => 1, 'www.billssite.com' => 1, 'www.billsothersite.com' => 1, 'bobssite.dealer.net' => 1, }, );

    Then you can easily compare the two files.

    sub compare_aliases { my ($host, $file1_aliases, $file2_aliases) = @_ my %aliases = map { $_ => 1 } keys(%$file1_aliases), keys(%$file2_aliases); for my $alias (keys(%aliases)) { if (!$file1_aliases->{$alias}) { ... } elsif (!$file2_aliases->{$alias}) { ... } } } sub compare_hosts { my ($file1_hosts, $file2_hosts) = @_ my %hosts = map { $_ => 1 } keys(%$file1_hosts), keys(%$file2_hosts); for my $host (keys(%hosts)) { if (!$file1_hosts->{$host}) { ... } elsif (!$file2_hosts->{$host}) { ... } else { compare_aliases($host, $file1_hosts->{$host}, $file2_hosts->{$host}, ); } } } compare_hosts(\%file1_hosts, \%file2_hosts);

    Update: Small bug fixes.

      different approach (just idea):
      # load files into alias => id pairs: my $file1 = { 'bobssite.autossl.net' => 'bobssite', ... }; my $file2 = { ... }; my @different_aliases = grep exists $file1->{$_} and $file1->{$_} ne $ +file2->{$_}, keys %$file2; delete $file1{keys %$file2}; my @aliases_not_in_file2 = keys %$file1; my @ids_not_in_file2 = do { my %tmp; @tmp{values %$file1} = values %$file1; keys %tmp; };
Re: Comparing Two Files
by ssandv (Hermit) on Jun 29, 2010 at 23:15 UTC

    If you need them to be the same, copy the canonical one to the non-canonical one. If you need to know how they differ, and they are guaranteed to have the same order, use diff or something similar.

    One way I might do it by hand if I felt I had to is to make a hash of hashes, parse the first file into it, where the host aliases were all keys and their values were say, -1, and then read the second file and add 1 to the value of %hash{hostid}{alias}. This gives you 0 if it's in both, -1 if it's in only the first file, and 1 if it's in only the second file.

Re: Comparing Two Files
by davido (Archbishop) on Jun 29, 2010 at 23:18 UTC

    I won't get into using a proper parser, aside from saying that anything less than a well-tested parsing module could become difficult to maintain if the data set starts throwing formats that you're not anticipating.

    Getting to your question, the first part is basically asking to determine if file1's host ID's are a subset of file2's. Tackle that question first. Pull your ID's from file2 into a hash, where the ID is the hash key. The value for each hash key should be a reference to an anonymous hash containing only the alias's that start with 'www.' For example:

    $file2{billsite} = { 'www.billsouthersite.com' => '', 'www.billsite.com' => '', };

    You don't really need the 2nd level hash to have values; you're only interested in the keys for quick lookups. Once you've pulled file2 into a hash of this nature, the next step is to iterate through file1. when you process one ID, you'll check to see if it's in %file2. If not, fail. Next you'll process each host-alias that fall under the ID you're processing in file1. As you do so, keep a count so that you can be sure that the quantity matches the number of keys of the 2nd level of the HoH %file2. For each host-alias in file1, check your HoH %file2 to see if that key exists under your current ID. If the key exists, and your keycount matches your file1 host-alias count for that ID, that record passes. If at any point there is a mismatch (not enough keys host-aliases, or a host-alias from file1 not found in file2, you can last or die out of your loop and fail without continuing to test.

    Once you visualize the datastructure the rest should come easy.


    Dave

Re: Comparing Two Files
by walkingthecow (Friar) on Jun 30, 2010 at 16:18 UTC
    Just want to thank you all for your help! I got it working, and the code I used is below. Unfortunately, I don't think it is the best way to do it, but it does work.
    #!/usr/bin/perl use strict; use warnings; use Getopt::Long; use Pod::Usage; my %alias_hash; my %host_hash; my %host_contents; my %seen; my $host_id; my $file1; my $file2; GetOptions( 'h|help' => sub { pod2usage( { -verbose => 1, -input = +> \*DATA, } ); exit; }, 'm|man' => sub { pod2usage( { -verbose => 2, -input = +> \*DATA, } ); exit; }, 'f1|file1=s' => \$file1, 'f2|file2=s' => \$file2, ); pod2usage( -verbose => 1 ) unless $file1 and $file2; open(my $file1_handle, '<', $file1) or die "Could not open $file1 ($!) +\n"; while (my $line=<$file1_handle>) { chomp $line; if ($line =~ /host id="(.*?)"/) { $host_id = $1; } if ($line =~ m{<host-alias>(www.*?)</host-alias>}) { $alias_hash{$host_id}{$1} = -1; } if (!$seen{$host_id}) { $host_hash{$host_id} = -1; } $seen{$host_id} = 1; } close $file1_handle; %seen=(); open(my $file2_handle, '<', $file2) or die "Could not open $file2 ($!) +\n"; while (my $line=<$file2_handle>) { chomp $line; if ($line =~ /host id="(.*?)"/) { $host_id = $1; } if ($line =~ m{<host-alias>(www.*?)</host-alias>}) { $alias_hash{$host_id}{$1}++; } if (!$seen{$host_id}) { $host_hash{$host_id}++; } $seen{$host_id} = 1; } close $file1_handle; for my $k1 ( keys %alias_hash ) { for my $k2 ( keys %{ $alias_hash{$k1} } ) { print "$k2 exists in only $file1\n" if $alias_hash{$k1}{$k2} +== -1; print "$k2 exists in only $file2\n" if $alias_hash{$k1}{$k2} +== 1; } } while ( my ($key, $value) = each(%host_hash) ) { print "$key exists in only $file1\n" if $host_hash{$key} == -1; print "$key exists in only $file2\n" if $host_hash{$key} == 1; }
    P.S.: You may notice that I use Pod::Usage and have not put anything in there for DATA. I just haven't gotten around to it yet, but it doesn't impact this script from working in any way ;)
Reaped: Re: Comparing Two Files
by NodeReaper (Curate) on Jul 01, 2010 at 09:15 UTC
Re: Comparing Two Files
by Proclus (Beadle) on Jul 01, 2010 at 09:15 UTC
    The files in question are XML. Why not use one of the nice Perl XML modules and convert it into a data structure as ikegami suggested?
    One disadvantage of this approach would be speed and more memory usage if the file is very large.( say +10MB?)

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://847211]
Approved by davido
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others meditating upon the Monastery: (6)
As of 2014-07-31 07:28 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (245 votes), past polls