http://www.perlmonks.org?node_id=148992

ybiC has asked for the wisdom of the Perl Monks concerning the following question:

The following chunk is intended to parse 2 logfiles for their diff, and write the results as a third file.   One origianl logfile contains combined STDOUT+STDERR and the other contains only STDERR.

But it doesn't do that at all.   8^(   Rather, the resulting (small-relative-to-either-original log) file contains an out-of-order hodgepodge from both originals.

What, praytell, am I doing wrong here?   davorg's Array::Compare doesn't look to do what I need.   I've looked at bikeNomad/Dominus' Alorithm::Diff, but don't make heads or tails of it.

#!/usr/bin/perl -w # rlog.pl $|++; # stdout hot use strict; # avoid d'oh! bugs require 5; # for following modules my $logDir = '/cygdrive/c/Rsync/logs'; my $allLog = "$logDir/200203021258.all"; my $errLog = "$logDir/200203021258.err"; my $fileLog = "$logDir/200203021258.fil"; # slurp two existing logs into arrays: open ERRLOG, "$errLog" or die "Error opening $errLog: $!"; my @err = <ERRLOG>; close ERRLOG; open ALLLOG, "$allLog" or die "Error opening $allLog: $!"; my @all = <ALLLOG>; close ALLLOG; # create @file from diff of ERR and ALL: my %count; my @file; $count{$_}++ for ( @all, @err ); for ( keys %count ) { push @file, $_ unless ( $count{$_} == 2 ); } # write $fileLog from @file: open FILELOG, "> $fileLog" or die "Error opening $fileLog: $!"; for (@file) { print FILELOG; } close FILELOG;

    cheers,
    Don
    striving toward Perl Adept
    (it's pronounced "why-bick")

Replies are listed 'Best First'.
Re: Save 2 files' diff as 3rd file
by vroom (His Eminence) on Mar 03, 2002 at 18:03 UTC
Re: Save 2 files' diff as 3rd file
by mattr (Curate) on Mar 03, 2002 at 17:47 UTC
    The file is small because of many duplicate keys it would seem..

    The output is out of order because hash is not meant to preserve order of key creation. You will need to keep a separate index of timestamps and sort by them if that is what you want. But even so, for a given log entry line with several duplicates, it will only pick up the last one. So the ideas of "all unique keys" and "chronological order" conflict. You could sort alphabetically easily enough though.

    Algorithm::Diff is going to do a diff, which tells you what changes are needed to turn one array into another. It is not a unique keys intersection.

    Hard to see what you want exactly, since when you mention log file I am expecting a timestamp on each line which would make every line unique. So I'll suppose you have no timestamps to worry about. But you do care about chronological order in each log file, and you I assume want to subtract the elements of err from all.

    You can dump @err into a hash (call it %errhash) just to speed up lookups, but you still need to step through each array because even if it looks like an entry in %errhash matches one in @all, it might really be chronologically much later. So just use a hash for an exists test but step through the array to maintain order. It is still a difficult problem because you cannot know about which segment of err matches which part of all easily. This is the problem addressed by the LCS test in Algorithm::Diff.

    Of course if you want to erase every appearance in @all of each element in @err, that is easier. One way to do it (untested) would be to

    my %allhash; my @result; $allhash{$_} = 1 foreach (@all); delete @allhash{@err}; # remove els for which key in err foreach (@all) { push (@result, $_) if exists $allhash{$_}; } print "$_\n" foreach (sort @result);
      You can always use the Tie::IxHash module to preserve the insertion order of hash entries. It's not included in the perl distribution but you can find it on CPAN.
Re: Save 2 files' diff as 3rd file
by Zaxo (Archbishop) on Mar 03, 2002 at 18:37 UTC

    Here's a version. It's not a general diff, but relies on the error log to contain everything to be rejected. There is dependency on both logs having error messages in the same order. That's reasonable, but may not be guaranteed. If more than one server is logging, there would be a problem.

    #!/usr/bin/perl -w # rlog.pl $|++; # stdout hot use strict; # avoid d'oh! bugs require 5; # for following modules my $logDir = '/cygdrive/c/Rsync/logs'; my $allLog = "$logDir/200203021258.all"; my $errLog = "$logDir/200203021258.err"; my $fileLog = "$logDir/200203021258.fil"; # open, but don't slurp open ALL, "< $allLog" or die $!; open ERR, "< $errLog" or die $!; open FLE, "> $fileLog" or die $!; while (<ERR>) { { local $/ = $_; my $diffs = <ALL> || "Alert: '$_' from $errLog not found\n"; chomp $diffs; print FLE $diffs; } } print FLE while <ALL>; close FLE or die $!; close ERR or die $!; close ALL or die $!;
    We go through the error log one line at a time. For each line, we look forward in the all-log until we find it, by diamond op and $/ magic. We delete the matched error line with chomp and print to the new log extract file. When done with errors, we tack on the rest of the log. Untested, but it should work.

    Update: Added a more graceful failure mode if an error line is missing.

    After Compline,
    Zaxo

Re: Save 2 files' diff as 3rd file
by shotgunefx (Parson) on Mar 03, 2002 at 23:13 UTC
    What do the log entires look like? Do Are they time stamped? If you could post a line or two, it would be helpful.

    -Lee

    "To be civilized is to deny one's nature."
Re: Save 2 files' diff as 3rd file
by abaxaba (Hermit) on Mar 04, 2002 at 05:14 UTC
    I like the sledgehammer method:
    #!/perl $logA = "/path/to/logfile1"; $logB = "/path/to/logfile2"; $outFile = "/path/to/resultant/File"; open (DIFF, "diff $logA $logB|"; open (OUT, ">$outFile"); select (OUT); while (<DIFF>){print;} close (OUT); close (DIFF);
Re: Save 2 files' diff as 3rd file (working code)
by ybiC (Prior) on Mar 04, 2002 at 17:25 UTC
    A round of thank you's and upvotes to all the Really Fine Monks who responded to my question.   8^)

    mattr hit the nail on the head, when he said "erase every appearance in @all of each element in @err", as did Zaxo with "relies on error log to contain everything to be rejected... dependency on both logs having error messages in same order."

    Code below works fine on Cygwin and Debian with included sample data, but blows it on real rsync.pl data.   ~daaang~

        cheers,
        Don
        striving toward Perl Adept
        (it's pronounced "why-bick")
    #!/usr/bin/perl -w # rtest9.pl $|++; # stdout hot use strict; # avoid d'oh! bugs require 5; # for following modules use Cwd 'chdir'; # move to particular directory use Tie::IxHash; # insertion-order retrieval for hash # my $logDir = '/cygdrive/C/Rsync/logs'; my $logDir = '/home/joe/rtest'; my $allLog = "all.log"; my $errLog = "err.log"; my $dirLog = "dir.log"; my $fileLog = "file.log"; chdir "$logDir"; open ALL, "< $allLog" or die $!; my @all = <ALL>; close ALL or die $!; open ERR, "< $errLog" or die $!; my @err = <ERR>; close ERR or die $!; tie my %allCount, "Tie::IxHash"; $allCount{$_}++ for( @all, @err ); my (@dirfile, @errchk); for( keys %allCount ) { if ( $allCount{$_} == 1 ) { push @dirfile, $_; } else { push @errchk, $_; } } my (@dir, @file); for(@dirfile){ if ( $_ =~ /\// ) { push @dir, $_; } else { push @file, $_; } } open DIR, "> $dirLog" or die $!; print DIR "$_" for(@dir); close DIR or die $!; open FIL, "> $fileLog" or die $!; print FIL "$_" for(@file); close FIL or die $!; =pod == all.log == file1 file2 error1 error2 file3 dir1/ dir2/ == err.log == error1 error2 =cut

    And here's my latest efforts.   Now using File::Rsync.   Not completely out of the woods, but making progress...

    #!/usr/bin/perl -w # rsf.pl # pod at tail $|++; # stdout hot use strict; # avoid d'oh! bugs require 5; # for following modules use File::Rsync; # wrapper for rsync directory sync tool my $logDir = '/home/joe/rtest'; my $outLog = "$logDir/out.log"; my $errLog = "$logDir/err.log"; ## RECEIVE my $srchost = 'indy:'; # my @src = qw(perls debs); my $src = 'perls'; my $dest = '/home/joe/rtest/'; ## SEND # my $srchost = ''; # my @src = qw(/home/joe/rtest /usr/local/perls); # my $dest = 'indy::Test'; my $obj = File::Rsync->new({ srchost => $srchost, src => \@src, dest => $dest, }); # src => $src, $obj->defopts({ archive => 1, verbose => 1, recursive => 1, owner => 1, perms => 1, group => 1, times => 1, debug => 0, compress => 0, 'dry-run' => 0, }); $obj->exec or warn "Rsync notice - check logs\n"; open OUT, "> $outLog" or die $!; open ERR, "> $errLog" or die $!; my @out = $obj->out; print OUT for(@out); my @err = $obj->err; my $stat = $obj->status; my $rstat = $obj->realstatus; print ERR for(@err); print OUT "status = $stat\n"; print OUT "realstatus = $rstat\n"; close OUT or die $!; close ERR or die $!; =head1 UPDATE 2002-03-11 16:45 CST Variableized object options Test on Cygwin rsh=>'/usr/bin/ssh' errs, ignored Debug sending to rsync server user@rhost::module only with rsh=>'/usr/local/bin/ssh' "@list=$obj->list" is only for no "dest" 2002-03-09 22:15 CST Initial working code =head1 TODO Figure out File::Rsync syntax to receive from multiple rsync modules @ERROR: Unknown module 'mod1 mod2' There is also a method for passing multiple source paths to a remote system by passing the remote hostname to the srchost key and passing an array ref to the source key... single trailing colon on the name... Loopify somehow for send+receive Re-test on Cygwin Pod::Usage Getopt::Long ? Logfile::Rotate ? ? Parallel::ForkManager ? =cut