Re^5: redirect output from a command to another command

Some further investigation revealed some interesting results.

First, I will note the conditions of the tests, as were used according to the specific needs I have for which I am using the algorithm. The main idea is, I want to compare words (strings of characters separated by whitespace) in two files. I am not concerned about changes in whitespace in the comparisons, so all groups of consecutive whitespace are collapsed to single \n characters. This of course was the necessary character to use for preprocessing for using diffutils diff, and for consistency, I left it the same in using the CPAN module.

For the CPAN module method, I used the example code from the CPAN Algorithm::Diff webpage to perform the actual comparison. The files were read into scalars, the substitutions were done, then the modified scalars were split into arrays at the \n's. These arrays are what is then used by the example code.

For the diffutils method, the files were read into scalars, subs made, then the modified scalars were written to temp files, the names of which were used as arguments to the diff command, being executed from the script.

Ultimately, I want to do a recursive comparison of file hierarchies, but for the sake of getting some clearer data from comparing the two algorithms, I first ran tests comparing the same two files numerous times, then compared the results yielded from the testing of each algorithm. This test would yield the closest comparison of strictly the algorithm itself. (with one possibly disputable exception which I will elaborate on below*). While the results from this test still revealed the diffutils method to be quite a bit faster, they were not the dramatic 45 fold difference that I observed yesterday. (more on the order of 3.3 times)

However, I still needed to test what I would really be doing, which is a recursive comparison. It was when I did these test that they revealed a 55 fold increase in time using the CPAN module. I do not understand the reason for such disporportionate results.

I have carefully laid out my methods and code below.

-----------

Tested on iMac G5 1.8 GHz PPC, 1 GB ram, OS 10.4.11.

Diff::Algorithm version 1.1902
diffutils version 2.8.1

First test was run comparing the algorithms alone, running the same two files 1000 times. This test was performed 5 times for each algorithm. This was done twice, alternating between the two. The files used were html files of approx 28kB each in length. They were not identical.

Results were:


CPAN Algorithm::Diff method:

    time to compare same two files 1000 times:

    40.28 sec
    39.68 sec
    39.97 sec
    40.17 sec
    39.70 sec
    
    39.71 sec
    39.61 sec
    39.82 sec
    39.71 sec
    39.60 sec
    
    avg. = 39.83 sec

diffutils method:

    time to compare same two files 1000 times:
    
    11.68 sec
    11.72 sec
    11.63 sec
    11.69 sec
    11.78 sec
    
    11.62 sec
    11.62 sec
    11.78 sec
    11.66 sec
    11.62 sec
    
    avg. = 11.68 sec
[download]

Second test was run doing a recursive comparison of two directories each parenting 105 html files. About half of the files were not identical. The total was approx. 2.6 MB for each tree. The recursion is iterated 10 times. I ran this test 10 times using diffutils method, and 2 times using the Algorithm::Diff method. After not being comfortable with my cpu running at the rail for 15 minutes, I then ran the Algorithm::Diff method iterating over the recursion once, then giving it a rest, and repeating. I repeated this 8 times. I alternated between the using the two algorithms.

results were:


CPAN Algorithm::Diff method:

    time to compare 105 file pairs, 10 times:
    
    926.7 sec
    924.2 sec
    
    time to compare 105 file pairs, 1 time:
    
    91.02 sec
    91.09 sec
    91.09 sec
    91.10 sec
    90.93 sec
    91.19 sec
    93.42 sec
    91.58 sec

    avg time for a single comparison of 105 file pairs:
    
    92.23 secs

diffutils method:

    time to compare 105 file pairs, 10 times:
    
    16.76 sec
    16.65 sec
    16.67 sec
    16.68 sec
    16.85 sec
    
    16.71 sec
    16.72 sec
    16.80 sec
    16.92 sec
    16.84 sec
    
    avg. time for single comparison of 105 file pairs:
    
    1.676 secs
[download]

Summary of tests:


repeatedly compare same two files 1000 times:

    average times:
    
    Algorithm::Diff    39.83  sec
    diffutils          11.68  sec

compare 105 different pairs of files 1 time:

    average times:
    
    Algorithm::Diff    92.23  sec
    diffutils           1.676 sec
[download]

*I will note that someone may dispute that in the first set of tests, in the case of testing with the Algorithm::Diff method, the operation of splitting the text string which is done on every iteration in the timing loop is not purely testing the algorithm alone. While this may be true, I did it this way so it would be a 1 to 1 comparison in the context of what I was trying to accomplish. IE, I wanted to have the same framework code, and just be able to interchange the two methods.

However, for the sake of fairness to the algorithm, I removed the split out of the timing loop and in performing the test 5 times the average time for 1000 iterations went to 26.92 sec. (I refrain from posting all the data on that). However, it should be noted that for the second test, it was necessary to have the split in the loop, since we are comparing different files every time.

---------------

Here is the code I used in the tests:

code used to run the same file 1000 times:


## this is the framework:  
## One of the two code snippets below are
## substituted for ### DIFF ALGORITHM HERE..

#!/usr/bin/perl

use strict;
use lib "/Users/allasso/AWS/utility/cpan/lib/perl5/site_perl";
require Algorithm::Diff;
use Time::HiRes qw( time );

my($source_path_1, $source_path_2) = @ARGV;

my $holdRS = $/;
local $/;
if (! open(FH, $source_path_1)) {
    print "unable to open source file 1: $source_path_1\n";
}
my $filestring_1 = <FH>;
$/ = $holdRS;
close(FH);

$holdRS = $/;
local $/;
if (! open(FH, $source_path_2)) {
    print "unable to open source file 2: $source_path_2\n";
}
my $filestring_2 = <FH>;
$/ = $holdRS;
close(FH);

$filestring_1 =~ s@\s+@\n@g;
$filestring_2 =~ s@\s+@\n@g;

my $time = time();

for my $count (0..999) {
    
   ### DIFF ALGORITHM HERE..

}

my $time_4sig = time() - $time + .005;
$time_4sig =~ s@^(.....).*@$1@;
print STDERR "\n\net:  ".$time_4sig."\n";

exit;
    

## this is the CPAN Algorithm::Diff code:

    my @seq1 = split(/\n/, $filestring_1);
    my @seq2 = split(/\n/, $filestring_2);

    my $diff = Algorithm::Diff->new( \@seq1, \@seq2 );
    
    $diff->Base( 1 );   # Return line numbers, not indices
    while(  $diff->Next()  ) {
        next   if  $diff->Same();
        my $sep = '';
        if(  ! $diff->Items(2)  ) {
            printf "%d,%dd%d\n",
                $diff->Get(qw( Min1 Max1 Max2 ));
        } elsif(  ! $diff->Items(1)  ) {
            printf "%da%d,%d\n",
                $diff->Get(qw( Max1 Min2 Max2 ));
        } else {
            $sep = "\n---\n";
            printf "%d,%dc%d,%d\n",
                $diff->Get(qw( Min1 Max1 Min2 Max2 ));
        }
        print "< $_"   for  $diff->Items(1);
        print $sep;
        print "> $_\n"   for  $diff->Items(2);
    }


## this is the diffutils code:

    if (! open(FH, ">/tmp/diff_774885959483_1")) {
        print "unable to open temporary file\n";
    }
    print FH "$filestring_1";
    close (FH);
    
    if (! open(FH, ">/tmp/diff_774885959483_2")) {
        print "unable to open temporary file\n";
    }
    print FH "$filestring_2";
    close (FH);
    
    print "$source_path_1  :::  $source_path_2\n";

    print `diff --suppress-common-lines -y /tmp/diff_774885959483_1 /t
+mp/diff_774885959483_2`;
[download]

This is the framework for the recursive comparison of 105 files, in which one of the two code snippets posted directly above were substituted for ## DIFF algorithm here.


#!/usr/bin/perl

use strict;
use lib "/Users/allasso/AWS/utility/cpan/lib/perl5/site_perl";
require Algorithm::Diff;
use Time::HiRes qw( time );

my($source_path_1, $source_path_2) = @ARGV;

$source_path_1 =~ s@\x2f*$@@;
$source_path_2 =~ s@\x2f*$@@;

my @src_list_1 = `find $source_path_1 -name "*.htm*"`;
my @src_list_2 = `find $source_path_2 -name "*.htm*"`;

my $time = time();

for my $count (0..9) {
    
    my $list_cnt = 0;
    
    for my $file_src_1 (@src_list_1) {
    
        my $file_src_2 = $src_list_2[$list_cnt++];
       
        chomp $file_src_1;
        chomp $file_src_2;
    
        my $holdRS = $/;
        local $/;
        if (! open(FH, $file_src_1)) {
            print "unable to open source file 1: $file_src_1\n";
        }
        my $filestring_1 = <FH>;
        $/ = $holdRS;
        close(FH);
        
        $holdRS = $/;
        local $/;
        if (! open(FH, $file_src_2)) {
            print "unable to open source file 2: $file_src_2\n";
        }
        my $filestring_2 = <FH>;
        $/ = $holdRS;
        close(FH);
        
        $filestring_1 =~ s@\s+@\n@g;
        $filestring_2 =~ s@\s+@\n@g;
    
        ### DIFF ALGORITHM HERE..

    }
}

my $time_4sig = time() - $time + .005;
$time_4sig =~ s@^(.....).*@$1@;
print STDERR "\n\net:  ".$time_4sig."\n";

exit;
[download]

I am also posting the full script for each method in which a recursive comparison was done (in which was yielded the curiously slow output using the CPAN module), copied and pasted directly after performing the tests for each method. I am doing this so eliminate any question about the posted code not reflecting the actual test:


## full recursive script using CPAN Algorithm::Diff :

#!/usr/bin/perl

use strict;
use lib "/Users/allasso/AWS/utility/cpan/lib/perl5/site_perl";
require Algorithm::Diff;
use Time::HiRes qw( time );

my($source_path_1, $source_path_2) = @ARGV;

$source_path_1 =~ s@\x2f*$@@;
$source_path_2 =~ s@\x2f*$@@;

my @src_list_1 = `find $source_path_1 -name "*.htm*"`;
my @src_list_2 = `find $source_path_2 -name "*.htm*"`;

my $time = time();

for my $count (0..9) {
    
    my $list_cnt = 0;
    
    for my $file_src_1 (@src_list_1) {
    
        my $file_src_2 = $src_list_2[$list_cnt++];
       
        chomp $file_src_1;
        chomp $file_src_2;
    
        my $holdRS = $/;
        local $/;
        if (! open(FH, $file_src_1)) {
            print "unable to open source file 1: $file_src_1\n";
        }
        my $filestring_1 = <FH>;
        $/ = $holdRS;
        close(FH);
        
        $holdRS = $/;
        local $/;
        if (! open(FH, $file_src_2)) {
            print "unable to open source file 2: $file_src_2\n";
        }
        my $filestring_2 = <FH>;
        $/ = $holdRS;
        close(FH);
        
        $filestring_1 =~ s@\s+@\n@g;
        $filestring_2 =~ s@\s+@\n@g;
        
            ## begin CPAN algorithm:
    
            my @seq1 = split(/\n/, $filestring_1);
            my @seq2 = split(/\n/, $filestring_2);
        
            my $diff = Algorithm::Diff->new( \@seq1, \@seq2 );
            
            $diff->Base( 1 );   # Return line numbers, not indices
            while(  $diff->Next()  ) {
                next   if  $diff->Same();
                my $sep = '';
                if(  ! $diff->Items(2)  ) {
                    printf "%d,%dd%d\n",
                        $diff->Get(qw( Min1 Max1 Max2 ));
                } elsif(  ! $diff->Items(1)  ) {
                    printf "%da%d,%d\n",
                        $diff->Get(qw( Max1 Min2 Max2 ));
                } else {
                    $sep = "\n---\n";
                    printf "%d,%dc%d,%d\n",
                        $diff->Get(qw( Min1 Max1 Min2 Max2 ));
                }
                print "< $_"   for  $diff->Items(1);
                print $sep;
                print "> $_\n"   for  $diff->Items(2);
            }
            
            ## end CPAN algorithm
    }
}

my $time_4sig = time() - $time + .005;
$time_4sig =~ s@^(.....).*@$1@;
print STDERR "\n\net:  ".$time_4sig."\n";

exit;
        

## full recursive script using diffutils :

#!/usr/bin/perl

use strict;
use lib "/Users/allasso/AWS/utility/cpan/lib/perl5/site_perl";
require Algorithm::Diff;
use Time::HiRes qw( time );

my($source_path_1, $source_path_2) = @ARGV;

$source_path_1 =~ s@\x2f*$@@;
$source_path_2 =~ s@\x2f*$@@;

my @src_list_1 = `find $source_path_1 -name "*.htm*"`;
my @src_list_2 = `find $source_path_2 -name "*.htm*"`;

my $time = time();

for my $count (0..9) {
    
    my $list_cnt = 0;
    
    for my $file_src_1 (@src_list_1) {
    
        my $file_src_2 = $src_list_2[$list_cnt++];
       
        chomp $file_src_1;
        chomp $file_src_2;
    
        my $holdRS = $/;
        local $/;
        if (! open(FH, $file_src_1)) {
            print "unable to open source file 1: $file_src_1\n";
        }
        my $filestring_1 = <FH>;
        $/ = $holdRS;
        close(FH);
        
        $holdRS = $/;
        local $/;
        if (! open(FH, $file_src_2)) {
            print "unable to open source file 2: $file_src_2\n";
        }
        my $filestring_2 = <FH>;
        $/ = $holdRS;
        close(FH);
        
        $filestring_1 =~ s@\s+@\n@g;
        $filestring_2 =~ s@\s+@\n@g;
    
            ## begin diffutils algorithm:
    
            if (! open(FH, ">/tmp/diff_774885959483_1")) {
                print "unable to open temporary file\n";
            }
            print FH "$filestring_1";
            close (FH);
            
            if (! open(FH, ">/tmp/diff_774885959483_2")) {
                print "unable to open temporary file\n";
            }
            print FH "$filestring_2";
            close (FH);
            
            #print "$file_src_1  :::  $file_src_1\n";
        
            print `diff --suppress-common-lines -y /tmp/diff_774885959
+483_1 /tmp/diff_774885959483_2`;
            
            ## end diffutils algorithm
    }
}

my $time_4sig = time() - $time + .005;
$time_4sig =~ s@^(.....).*@$1@;
print STDERR "\n\net:  ".$time_4sig."\n";

exit;
[download]

Comment on Re^5: redirect output from a command to another command Select or Download Code


XP is just a number
	PerlMonks