Why does Perl get slower when building a larger hash? (Not due to the memory swapping)

chialingh has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I have 10,000 input files and each of them contains a list of numbers. What I want to do is comparing those files pairwisely and find common items between any two files. At the end, I'll store my results in three hashes.

1. common items/total number of items in file 1 and file 2

2. common items/number of items in file 1

3. common items/number of items in file 2

However, i did a test and found that the time needed to calculate the overlapping between two files increases with the number of the total files.

Here is my time log:

300 files: 0.15 ms (calculation time for each file pair)

500 files: 0.19 ms

700 files: 0.24 ms

900 files: 0.28 ms

1100 files: 0.33 ms

1300 files: 0.37 ms

2500 files: 0.55 ms

4500 files: 0.9 ms

My computer has sufficient memory and I'm pretty sure it didn't do memory swapping when running the script. Can anyone kindly tell me why the unit time increases with the number of input files?

Thanks!

#!/usr/bin/perl

use strict;

for(my $r = 200; $r<=10000; $r = $r + 200){
        my %file_list; # list of files
        my %file_gene;
        for(my $i = 1; $i <=$r; $i++){
                my $file = "$i.txt";
                open(INF, "$file");
                while(my $g=<INF>){
                        chomp $g;
                        $file_gene{$i}{$g} = ();
                }
                close INF;
                $file_list{$i} = ();
        }
        close IN;

        my %hash3; # overlapping percentage of file pairs
        my %hash4; # percentage of common items in file1
        my %hash5; # percentage of common items in file2

        my @file_list_array = keys %file_list; # list of file names
        my $file_number = $#file_list_array; # number of files - 1

        my @time = localtime(time);
        my $hr1 = $time[2];
        my $min1 = $time[1];
        my $sec1 = $time[0];

        my $x = 0;
                for(my $i = 0; $i<= $file_number - 1; $i++){
                        my $m1 = $file_list_array[$i];
                        my $value3 = 0;
                        my $value4 = 0;
                        my $value5 = 0;
                        my $pair;
                        for(my $j = $i + 1; $j <= $file_number; $j++){
                                $x = $x + 1;
                                my $m2 = $file_list_array[$j];
                                my ($Nvalue3, $Nvalue4, $Nvalue5) = fi
+nd_common_items($m1, $m2, \%file_gene, 0.1);
                                if($Nvalue3 > $value3){
                                        $pair = $m1."_".$m2; # file pa
+ir name
                                        $value3 = $Nvalue3;
                                        $value4 = $Nvalue4;
                                        $value5 = $Nvalue5;
                                }
                        }
                        if($pair){
                                $hash3{"$pair"} = $value3;
                                $hash4{"$pair"} = $value4;
                                $hash5{"$pair"} = $value5;
                        }
                }

        my @time = localtime(time);

        my $hr2 = $time[2];
        my $min2 = $time[1];
        my $sec2 = $time[0];

        my $hr = $hr2 - $hr1;
        my $min = $min2 - $min1;
        my $sec = $sec2 - $sec1;

        my $time = $hr*3600 + $min*60 + $sec;

        my $unit_time = $time/$x;
        print "$r files\t$unit_time sec\n";
}

sub find_common_items{

        my ($m1, $m2, $file_gene_ref, $cutoff) = @_;

        my %file_genes = %$file_gene_ref;

        my %hash1;
        my %hash2;

        my %hash1 = %{$file_genes{$m1}}; # genes in file m1
        my %hash2 = %{$file_genes{$m2}}; # genes in file m2

        my %intersection; # intersection items
        my %union;        # union items
        my $value3 = 0;
        my $value4 = 0;
        my $value5 = 0;

        # find intersection items
        foreach(keys %hash1){
                $intersection{$_} = $hash1{$_} if exists $hash2{$_};
        }
        my $isn = scalar keys %intersection; # number of intersection 
+items

        # find union items
        @union{keys %hash1, keys %hash2} = ();
        my $un = scalar keys %union; # number of union items

        # only store qualified file pairs
        if($isn/$un > $cutoff){ 
                my $s1 = scalar keys %hash1; # number of items in file
+ m1, size of file m1
                my $s2 = scalar keys %hash2; # number of items in file
+ m2, size of file m2
                $value3 = $isn/$un; # For file pair m1_m2, overlapping
+ percentage of file pairs = intersection/union
                $value4 = $isn/$s1;  # For file pair m1_m2, percentage
+ of common genes in file m1 = intersection/size of file m1
                $value5 = $isn/$s2;  # For file pair m1_m2, percentage
+ of common genes in file m2 = intersection/size of file m2
        }
        return($value3, $value4, $value5);
}
[download]

Comment on Why does Perl get slower when building a larger hash? (Not due to the memory swapping) Download Code

Replies are listed 'Best First'.
Re: Why does Perl get slower when building a larger hash? (Not due to the memory swapping) by kcott (Archbishop) on Mar 01, 2013 at 02:37 UTC
G'day chialingh, Welcome to the monastery. The first thing I'd suggest doing is adding `use warnings;` (warnings) after `use strict;` and fix up problems that are reported. There may be others, but a couple that grabbed my attention were `close IN;` (a filehandle that is never opened) and `my %hash1; ... my %hash1 ...` (declared twice in same scope - ditto for `%hash2`). One place where I believe you're doing unnecessary processing is: `my %file_genes = %$file_gene_ref; ... %hash1 = %{$file_genes{$m1}}; %hash2 = %{$file_genes{$m2}}; ... $intersection{$_} = $hash1{$_} if exists $hash2{$_}; ...` [download] Rather than performing all those hash dereferences, I believe you could simply write something like: `$intersection{$_} = $file_gene_ref->{$m1}{$_} if exists $file_gene_ref->{$m2}{$_};` [download] You'd need to handle other references to `%hash1` and `%hash2` in a similar fashion. At the very least, I don't believe you need to create the intermediate hash `%file_genes`: you could just create `%hash1` and `%hash2` directly, like this: `%hash1 = %{$file_gene_ref->{$m1}}; %hash2 = %{$file_gene_ref->{$m2}};` [download] If that doesn't help, you'll need to profile your code. Devel::NYTProf may be a good place to start; other modules listed under CPAN - Development Support might be useful. -- Ken	[reply] [d/l] [select]
Re^2: Why does Perl get slower when building a larger hash? (Not due to the memory swapping) by chialingh (Initiate) on Mar 03, 2013 at 20:44 UTC
Hi Ken, Thank you so much! Your solution works! :D I'm very curious about the reason behind this. Is it due to the memory usage? Chialingh	[reply]
Re: Why does Perl get slower when building a larger hash? (Not due to the memory swapping) by tmharish (Friar) on Mar 01, 2013 at 06:46 UTC
You are measuring time across a loop: `my @time = localtime(time); ... for(my $i = 0; $i<= $file_number - 1; $i++){ ... for(my $j = $i + 1; $j <= $file_number; $j++){ $x = $x + 1; my $m2 = $file_list_array[$j]; my ($Nvalue3, $Nvalue4, $Nvalue5) = fi +nd_common_items($m1, $m2, \%file_gene, 0.1); ... } ... } my @time = localtime(time);` [download] However, i did a test and found that the time needed to calculate the overlapping between two files increases with the number of the total files. Thats actually not what you are measuring - you are measuring the time taken to compare every file with every other file, and of course that time will go up with the number of files! -- Your loop should be close to O(Nlog(N))	[reply] [d/l]
Re^2: Why does Perl get slower when building a larger hash? (Not due to the memory swapping) by chialingh (Initiate) on Mar 01, 2013 at 16:21 UTC
Yes, what I did in the code is measuring the time taken to compare every file with every other file. However, the time log I showed is the running time divided by the number of file pairs, i.e. $time/$x In that case, the time is the unit time for each comparison and it shouldn't go up with the number of files, right? Sorry for the misleading code, I'll revise it. :)	[reply]
Re^3: Why does Perl get slower when building a larger hash? (Not due to the memory swapping) by tmharish (Friar) on Mar 04, 2013 at 08:56 UTC
Dividing by $x ( number of files ) does not help cause you still have a O( log(N) ) in there ( the inner loop ) As a demonstration, consider the following code that benchmarks hash access times: use strict ; use warnings ; use Time::HiRes qw( time ) ; my @bench_mark_points = ( 1_000, 10_000, 100_000, 1_000_000, 10_000_00 +0, 20_000_000, 30_000_000, 40_000_000, 50_000_000 ) ; my $bench_mark_point = 0 ; my %large_hash ; my @keys ; for( my $elem = 0 ; $elem < 50_000_001; $elem++ ) { my $key = rand( $elem ) . "hello" . rand( $elem ); $large_hash{ $key } = 1 ; push @keys, $key ; if( $elem == $bench_mark_points[ $bench_mark_point ] ) { _bench_mark( $bench_mark_points[ $bench_mark_point ] ) ; $bench_mark_point++ ; } } sub _bench_mark { my $benchmark_point = shift ; my @benchmark_keys = map( { $keys[ int( rand( $benchmark_point ) +) ] } ( 0 .. 1_000_000 ) ) ; my $total_time = 0 ; foreach my $key ( @benchmark_keys ) { my $start_time = time; my $val = $large_hash{ $key } ; my $end_time = time ; $total_time += ( $end_time - $start_time ) ; } print "Benchmarked Hash access of size $benchmark_point \t -> Acce +ss time for 1_000_000 keys: " . $total_time . "\n"; return ; } [download] Output Benchmarked Hash access of size 1000 -> Access time for 1_000_000 keys: 0.11689305305481 Benchmarked Hash access of size 10000 -> Access time for 1_000_000 keys: 0.121062278747559 Benchmarked Hash access of size 100000 -> Access time for 1_000_000 keys: 0.125393152236938 Benchmarked Hash access of size 1000000 -> Access time for 1_000_000 keys: 0.116819381713867 Benchmarked Hash access of size 10000000 -> Access time for 1_000_000 keys: 0.118601083755493 Benchmarked Hash access of size 20000000 -> Access time for 1_000_000 keys: 0.117170572280884 You will notice that the time to access a million keys is nearly always the same - The memory on the other hand shoots up dramatically reaching 7.8 GB at around 25 Million entries ( which is why my bench-marking stops there ) Perl is working really hard in the background to ensure that it builds structures that can ensure that finding an element ( which should be O(N) ) is closing constant time here. UPDATE: Here is the output again with the CPU throttled so the difference can actually be seen: Benchmarked Hash access of size 1000 -> Access time for 1_000_000 keys: 0.438439846038818 Benchmarked Hash access of size 10000 -> Access time for 1_000_000 keys: 0.467595815658569 Benchmarked Hash access of size 100000 -> Access time for 1_000_000 keys: 0.615071773529053 Benchmarked Hash access of size 1000000 -> Access time for 1_000_000 keys: 0.804181814193726 Benchmarked Hash access of size 10000000 -> Access time for 1_000_000 keys: 0.873048782348633 Benchmarked Hash access of size 20000000 -> Access time for 1_000_000 keys: 0.910530567169189 So the time does go up - and thats expected. You cannot find an element in less than O( logN ); The speed here is because Perl optimizes this at a low level, but eventually the effect adds up.	[reply] [d/l]
Re: Why does Perl get slower when building a larger hash? (Not due to the memory swapping) by sundialsvc4 (Abbot) on Mar 01, 2013 at 01:51 UTC
Superficially, I see only a straight-linear (more or less...) progression here. Could you please elaborate as to what exactly you expect to see (differently), and why? What is it about what you see, that triggers an alarm?
Re: Why does Perl get slower when building a larger hash? (Not due to the memory swapping) by jwkrahn (Abbot) on Mar 09, 2013 at 02:52 UTC
You have some problems with your code. For example, you have created 55 variables, some of them duplicates, while I was able to write the same code with only 21 variables. You use a hash to collect the file list and then copy its keys to an array when you only need the array. You use lists of scalars when you could just use a single array. You use three separate hashes when you could use a hash of arrays. Here is my version of your code: #!/usr/bin/perl use warnings; use strict; for my $files ( 1 .. 50 ) { $files *= 200; my @file_list; # list of files my %file_gene; for my $i ( 1 .. $files ) { open my $INF, '<', "$i.txt" or die "Cannot open '$i.txt' becau +se: $!"; chomp( my @g = <$INF> ); @{ $file_gene{ $i } }{ @g } = (); push @file_list, $i; } # overlapping percentage of file pairs # percentage of common items in file1 # percentage of common items in file2 my %file_pairs; my $time = time; my $pairs = 0; for my $i ( 0 .. $#file_list ) { my $m1 = $file_list[ $i ]; my ( @values, $pair ) = ( 0, 0, 0 ); for my $m2 ( @file_list[ $i + 1 .. $#file_list ] ) { ++$pairs; my @Nvalues = find_common_items( @file_gene{ $m1, $m2 }, 0 +.1 ); if ( $Nvalues[ 0 ] > $values[ 0 ] ) { $pair = $m1 . '_' . $m2; # file pair name @values = @Nvalues; } } if ( $pair ) { $file_pairs{ $pair } = \@values; } } $time = time - $time; print "$files files\t$pairs pairs\t$time sec\n"; } sub find_common_items { my ( $m1, $m2, $cutoff ) = @_; # find number of intersection items my $isn = grep exists $m2->{ $_ }, keys %$m1; # number of union items my $un = keys %{ { %$m1, %$m2 } }; # only store qualified file pairs my @values = ( 0, 0, 0 ); if ( $isn / $un > $cutoff ) { # For file pair m1_m2, overlapping percentage of file pairs = +intersection/union # For file pair m1_m2, percentage of common genes in file m1 = + intersection/size of file m1 # For file pair m1_m2, percentage of common genes in file m2 = + intersection/size of file m2 @values = ( $isn / $un, $isn / keys %$m1, $isn / keys %$m2 ) } return @values; } [download]	[reply] [d/l]


more useful options
	PerlMonks