Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

Why does Perl get slower when building a larger hash? (Not due to the memory swapping)

by chialingh (Initiate)
on Feb 28, 2013 at 22:57 UTC ( #1021121=perlquestion: print w/ replies, xml ) Need Help??
chialingh has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I have 10,000 input files and each of them contains a list of numbers. What I want to do is comparing those files pairwisely and find common items between any two files. At the end, I'll store my results in three hashes.

1. common items/total number of items in file 1 and file 2

2. common items/number of items in file 1

3. common items/number of items in file 2

However, i did a test and found that the time needed to calculate the overlapping between two files increases with the number of the total files.

Here is my time log:

300 files: 0.15 ms (calculation time for each file pair)

500 files: 0.19 ms

700 files: 0.24 ms

900 files: 0.28 ms

1100 files: 0.33 ms

1300 files: 0.37 ms

2500 files: 0.55 ms

4500 files: 0.9 ms

My computer has sufficient memory and I'm pretty sure it didn't do memory swapping when running the script. Can anyone kindly tell me why the unit time increases with the number of input files?

Thanks!
#!/usr/bin/perl use strict; for(my $r = 200; $r<=10000; $r = $r + 200){ my %file_list; # list of files my %file_gene; for(my $i = 1; $i <=$r; $i++){ my $file = "$i.txt"; open(INF, "$file"); while(my $g=<INF>){ chomp $g; $file_gene{$i}{$g} = (); } close INF; $file_list{$i} = (); } close IN; my %hash3; # overlapping percentage of file pairs my %hash4; # percentage of common items in file1 my %hash5; # percentage of common items in file2 my @file_list_array = keys %file_list; # list of file names my $file_number = $#file_list_array; # number of files - 1 my @time = localtime(time); my $hr1 = $time[2]; my $min1 = $time[1]; my $sec1 = $time[0]; my $x = 0; for(my $i = 0; $i<= $file_number - 1; $i++){ my $m1 = $file_list_array[$i]; my $value3 = 0; my $value4 = 0; my $value5 = 0; my $pair; for(my $j = $i + 1; $j <= $file_number; $j++){ $x = $x + 1; my $m2 = $file_list_array[$j]; my ($Nvalue3, $Nvalue4, $Nvalue5) = fi +nd_common_items($m1, $m2, \%file_gene, 0.1); if($Nvalue3 > $value3){ $pair = $m1."_".$m2; # file pa +ir name $value3 = $Nvalue3; $value4 = $Nvalue4; $value5 = $Nvalue5; } } if($pair){ $hash3{"$pair"} = $value3; $hash4{"$pair"} = $value4; $hash5{"$pair"} = $value5; } } my @time = localtime(time); my $hr2 = $time[2]; my $min2 = $time[1]; my $sec2 = $time[0]; my $hr = $hr2 - $hr1; my $min = $min2 - $min1; my $sec = $sec2 - $sec1; my $time = $hr*3600 + $min*60 + $sec; my $unit_time = $time/$x; print "$r files\t$unit_time sec\n"; } sub find_common_items{ my ($m1, $m2, $file_gene_ref, $cutoff) = @_; my %file_genes = %$file_gene_ref; my %hash1; my %hash2; my %hash1 = %{$file_genes{$m1}}; # genes in file m1 my %hash2 = %{$file_genes{$m2}}; # genes in file m2 my %intersection; # intersection items my %union; # union items my $value3 = 0; my $value4 = 0; my $value5 = 0; # find intersection items foreach(keys %hash1){ $intersection{$_} = $hash1{$_} if exists $hash2{$_}; } my $isn = scalar keys %intersection; # number of intersection +items # find union items @union{keys %hash1, keys %hash2} = (); my $un = scalar keys %union; # number of union items # only store qualified file pairs if($isn/$un > $cutoff){ my $s1 = scalar keys %hash1; # number of items in file + m1, size of file m1 my $s2 = scalar keys %hash2; # number of items in file + m2, size of file m2 $value3 = $isn/$un; # For file pair m1_m2, overlapping + percentage of file pairs = intersection/union $value4 = $isn/$s1; # For file pair m1_m2, percentage + of common genes in file m1 = intersection/size of file m1 $value5 = $isn/$s2; # For file pair m1_m2, percentage + of common genes in file m2 = intersection/size of file m2 } return($value3, $value4, $value5); }

Comment on Why does Perl get slower when building a larger hash? (Not due to the memory swapping)
Download Code
Re: Why does Perl get slower when building a larger hash? (Not due to the memory swapping)
by sundialsvc4 (Abbot) on Mar 01, 2013 at 01:51 UTC

    Superficially, I see only a straight-linear (more or less...) progression here.   Could you please elaborate as to what exactly you expect to see (differently), and why?   What is it about what you see, that triggers an alarm?

Re: Why does Perl get slower when building a larger hash? (Not due to the memory swapping)
by kcott (Abbot) on Mar 01, 2013 at 02:37 UTC

    G'day chialingh,

    Welcome to the monastery.

    The first thing I'd suggest doing is adding use warnings; (warnings) after use strict; and fix up problems that are reported. There may be others, but a couple that grabbed my attention were close IN; (a filehandle that is never opened) and my %hash1; ... my %hash1 ... (declared twice in same scope - ditto for %hash2).

    One place where I believe you're doing unnecessary processing is:

    my %file_genes = %$file_gene_ref; ... %hash1 = %{$file_genes{$m1}}; %hash2 = %{$file_genes{$m2}}; ... $intersection{$_} = $hash1{$_} if exists $hash2{$_}; ...

    Rather than performing all those hash dereferences, I believe you could simply write something like:

    $intersection{$_} = $file_gene_ref->{$m1}{$_} if exists $file_gene_ref->{$m2}{$_};

    You'd need to handle other references to %hash1 and %hash2 in a similar fashion. At the very least, I don't believe you need to create the intermediate hash %file_genes: you could just create %hash1 and %hash2 directly, like this:

    %hash1 = %{$file_gene_ref->{$m1}}; %hash2 = %{$file_gene_ref->{$m2}};

    If that doesn't help, you'll need to profile your code. Devel::NYTProf may be a good place to start; other modules listed under CPAN - Development Support might be useful.

    -- Ken

      Hi Ken,

      Thank you so much! Your solution works! :D

      I'm very curious about the reason behind this. Is it due to the memory usage?

      Chialingh
Re: Why does Perl get slower when building a larger hash? (Not due to the memory swapping)
by tmharish (Friar) on Mar 01, 2013 at 06:46 UTC

    You are measuring time across a loop:

    my @time = localtime(time); ... for(my $i = 0; $i<= $file_number - 1; $i++){ ... for(my $j = $i + 1; $j <= $file_number; $j++){ $x = $x + 1; my $m2 = $file_list_array[$j]; my ($Nvalue3, $Nvalue4, $Nvalue5) = fi +nd_common_items($m1, $m2, \%file_gene, 0.1); ... } ... } my @time = localtime(time);
    However, i did a test and found that the time needed to calculate the overlapping between two files increases with the number of the total files.

    Thats actually not what you are measuring - you are measuring the time taken to compare every file with every other file, and of course that time will go up with the number of files! -- Your loop should be close to O(Nlog(N))

      Yes, what I did in the code is measuring the time taken to compare every file with every other file. However, the time log I showed is the running time divided by the number of file pairs, i.e. $time/$x

      In that case, the time is the unit time for each comparison and it shouldn't go up with the number of files, right?

      Sorry for the misleading code, I'll revise it. :)

        Dividing by $x ( number of files ) does not help cause you still have a O( log(N) ) in there ( the inner loop )

        As a demonstration, consider the following code that benchmarks hash access times:

        use strict ; use warnings ; use Time::HiRes qw( time ) ; my @bench_mark_points = ( 1_000, 10_000, 100_000, 1_000_000, 10_000_00 +0, 20_000_000, 30_000_000, 40_000_000, 50_000_000 ) ; my $bench_mark_point = 0 ; my %large_hash ; my @keys ; for( my $elem = 0 ; $elem < 50_000_001; $elem++ ) { my $key = rand( $elem ) . "hello" . rand( $elem ); $large_hash{ $key } = 1 ; push @keys, $key ; if( $elem == $bench_mark_points[ $bench_mark_point ] ) { _bench_mark( $bench_mark_points[ $bench_mark_point ] ) ; $bench_mark_point++ ; } } sub _bench_mark { my $benchmark_point = shift ; my @benchmark_keys = map( { $keys[ int( rand( $benchmark_point ) +) ] } ( 0 .. 1_000_000 ) ) ; my $total_time = 0 ; foreach my $key ( @benchmark_keys ) { my $start_time = time; my $val = $large_hash{ $key } ; my $end_time = time ; $total_time += ( $end_time - $start_time ) ; } print "Benchmarked Hash access of size $benchmark_point \t -> Acce +ss time for 1_000_000 keys: " . $total_time . "\n"; return ; }

        Output

        Benchmarked Hash access of size 1000    	 -> Access time for 1_000_000 keys: 0.11689305305481
        Benchmarked Hash access of size 10000   	 -> Access time for 1_000_000 keys: 0.121062278747559
        Benchmarked Hash access of size 100000   	 -> Access time for 1_000_000 keys: 0.125393152236938
        Benchmarked Hash access of size 1000000 	 -> Access time for 1_000_000 keys: 0.116819381713867
        Benchmarked Hash access of size 10000000 	 -> Access time for 1_000_000 keys: 0.118601083755493
        Benchmarked Hash access of size 20000000 	 -> Access time for 1_000_000 keys: 0.117170572280884
        

        You will notice that the time to access a million keys is nearly always the same - The memory on the other hand shoots up dramatically reaching 7.8 GB at around 25 Million entries ( which is why my bench-marking stops there )

        Perl is working really hard in the background to ensure that it builds structures that can ensure that finding an element ( which should be O(N) ) is closing constant time here.

        UPDATE: Here is the output again with the CPU throttled so the difference can actually be seen:

        Benchmarked Hash access of size 1000    	 -> Access time for 1_000_000 keys: 0.438439846038818
        Benchmarked Hash access of size 10000   	 -> Access time for 1_000_000 keys: 0.467595815658569
        Benchmarked Hash access of size 100000  	 -> Access time for 1_000_000 keys: 0.615071773529053
        Benchmarked Hash access of size 1000000 	 -> Access time for 1_000_000 keys: 0.804181814193726
        Benchmarked Hash access of size 10000000 	 -> Access time for 1_000_000 keys: 0.873048782348633
        Benchmarked Hash access of size 20000000 	 -> Access time for 1_000_000 keys: 0.910530567169189
        

        So the time does go up - and thats expected. You cannot find an element in less than O( logN ); The speed here is because Perl optimizes this at a low level, but eventually the effect adds up.

Re: Why does Perl get slower when building a larger hash? (Not due to the memory swapping)
by jwkrahn (Monsignor) on Mar 09, 2013 at 02:52 UTC

    You have some problems with your code.    For example, you have created 55 variables, some of them duplicates, while I was able to write the same code with only 21 variables.    You use a hash to collect the file list and then copy its keys to an array when you only need the array.    You use lists of scalars when you could just use a single array.    You use three separate hashes when you could use a hash of arrays.

    Here is my version of your code:

    #!/usr/bin/perl use warnings; use strict; for my $files ( 1 .. 50 ) { $files *= 200; my @file_list; # list of files my %file_gene; for my $i ( 1 .. $files ) { open my $INF, '<', "$i.txt" or die "Cannot open '$i.txt' becau +se: $!"; chomp( my @g = <$INF> ); @{ $file_gene{ $i } }{ @g } = (); push @file_list, $i; } # overlapping percentage of file pairs # percentage of common items in file1 # percentage of common items in file2 my %file_pairs; my $time = time; my $pairs = 0; for my $i ( 0 .. $#file_list ) { my $m1 = $file_list[ $i ]; my ( @values, $pair ) = ( 0, 0, 0 ); for my $m2 ( @file_list[ $i + 1 .. $#file_list ] ) { ++$pairs; my @Nvalues = find_common_items( @file_gene{ $m1, $m2 }, 0 +.1 ); if ( $Nvalues[ 0 ] > $values[ 0 ] ) { $pair = $m1 . '_' . $m2; # file pair name @values = @Nvalues; } } if ( $pair ) { $file_pairs{ $pair } = \@values; } } $time = time - $time; print "$files files\t$pairs pairs\t$time sec\n"; } sub find_common_items { my ( $m1, $m2, $cutoff ) = @_; # find number of intersection items my $isn = grep exists $m2->{ $_ }, keys %$m1; # number of union items my $un = keys %{ { %$m1, %$m2 } }; # only store qualified file pairs my @values = ( 0, 0, 0 ); if ( $isn / $un > $cutoff ) { # For file pair m1_m2, overlapping percentage of file pairs = +intersection/union # For file pair m1_m2, percentage of common genes in file m1 = + intersection/size of file m1 # For file pair m1_m2, percentage of common genes in file m2 = + intersection/size of file m2 @values = ( $isn / $un, $isn / keys %$m1, $isn / keys %$m2 ) } return @values; }

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1021121]
Approved by davies
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others lurking in the Monastery: (11)
As of 2014-10-01 18:26 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    What is your favourite meta-syntactic variable name?














    Results (32 votes), past polls