Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Perl Code Runs Extremely Slow

by garbage777 (Acolyte)
on Jun 13, 2006 at 20:00 UTC ( #555114=perlquestion: print w/ replies, xml ) Need Help??
garbage777 has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks, I have written a small script to process to files (which are of different file size each) the following way -

1. Open file 1 and create a hash of (key, value) pairs.
2. Open file 2, create a hash of (key, value) pairs.
3. Using the keys in Step.1, print the value from the hash created in Step.2.

To perform these three steps, I am using the following code (please excuse any inadvertent mistake, this is only my second Perl program).
#!/usr/bin/perl use strict; use warnings; open(IHF, "<", "File1"); print "\n\n"; while( <IHF> ) { my %fets; my ($k, $v) = split; $fets{$k} = $v if defined ($k) and defined ($v); my @keys = sort keys %fets; my $trgt = getval(@keys); print "$k $fets{$k} $trgt\n"; } sub getval { my @k = @_; my %vals; open (JHF, "<", "File2"); while(<JHF>) { my ($key, $val) = split; $vals{$key} = $val if defined $val and defined $key; } for my $i (@k) { return "$vals{$i}"; } }
When I execute this code on File1(of size 3.4 Gb) and File2 of size (1.4 Gb), the code runs extremely slow.
Can any Monks kindly suggest anyway to improve run time performance?
I appreciate it very much.

Comment on Perl Code Runs Extremely Slow
Download Code
Re: Perl Code Runs Extremely Slow
by samtregar (Abbot) on Jun 13, 2006 at 20:06 UTC
    There are so many performance problems in this code that it's kind of hard to know where to begin! Here's a few that jump out right away:

    • Don't open file 2 for each line of file 1 and read through every line! If there are 1 million lines in file 1 and 500 thousand lines in file 2 then you'll read 500 billion lines from file 2! Instead read file 2 once and re-use the hash for each lookup.
    • Don't re-sort all the keys from file 1 everytime you read a line from file 1. (UPDATE: Looking again I see that %fets is actually local to the while(). Why are you using a hash at all here? Why are you calling sort() when only one key is present?)
    • You may not have enough memory to actually hold all of file 2 in memory at once. If you don't you'll run into swap, which will be slow no matter what you do. You can fix this by storing the hash in a database file via DB_File or something similar.

    -sam

Re: Perl Code Runs Extremely Slow
by jwkrahn (Monsignor) on Jun 13, 2006 at 20:27 UTC
    You don't really need two hashes for that. Your code can be simplified to:
    #!/usr/bin/perl use strict; use warnings; open JHF, '<', 'File2' or die "Cannot open 'File2' $!"; my %vals; while ( <JHF> ) { my ( $key, $val ) = split; $vals{ $key } = $val if defined $val and defined $key; } close JHF; open IHF, '<', 'File1' or die "Cannot open 'File1' $!"; print "\n\n"; while ( <IHF> ) { my ( $k, $v ) = split; next unless defined $k and defined $v; print "$k $v $vals{$k}\n" if exists $vals{ $k }; }
Re: Perl Code Runs Extremely Slow
by Mandrake (Chaplain) on Jun 14, 2006 at 06:07 UTC
    Another approach would be to use a hash for first file and use exists operator to check for the existance of each key in second. Example :
    #!/usr/bin/perl -w use strict; my %fets; print "\n\n"; open IHF, "<File1" || die "Could not open File1".$!; while( <IHF> ) { my ($k, $v) = split; $fets{$k} = $v if defined ($k) and defined ($v); } close IHF ; open JHF, "<File2" || die "Could not open File2".$!; while(<JHF>) { my ($key, $val) = split; print $key." ".$fets{$key}." ".$val."\n" if exists $fets{$key} ; } close JHF ;
    Thanks..

      A few minor problems with your code. Here's my idea of a better version (update: starting from the second verse).

      open IHF, "<", "File1" or die "Could not open File1: $!"; while( <IHF> ) { chomp; my ($k, $v) = split; $fets{$k} = $v if defined ($k) and defined ($v); } close IHF; open JHF, "<", "File2" or die "Could not open File2: $!"; while(<JHF>) { chomp; my ($key, $val) = split; print "$key $fets{$key} $val\n" if exists $fets{$key}; } close JHF;

      • another intruder with the mooring in the heart of the Perl

        I see you chomp the lines. I missed it probably because the orginal post does not seem to do it.
        Many thanks for the correction.
      That's unlikely to work very well. File 1 is gigantic, so reading it all into a single a hash is likely to cause either an out-of-memory error or at least run into swap and run really slowly. There's a good summary of how much memory a hash will use in Devel::Size - it's likely to be much more than the size of the file which just about guarantees that it can't fit in memory in this case.

      -sam

Re: Perl Code Runs Extremely Slow
by Moron (Curate) on Jun 14, 2006 at 13:12 UTC
    My recommendations would include:

    - open and pass through each file only once for the entire run

    - explicitly close files

    - don't use sort, generally unnecessary except when sorting output

    - if either of the files is reused for multiple runs, cache its hash into a Storable instead of passing through it each time.

    The other replies already supply examples for some of these points.

    -M

    Free your mind

Re: Perl Code Runs Extremely Slow
by TGI (Vicar) on Jun 14, 2006 at 18:57 UTC
    #!/usr/bin/perl use strict; use warnings; my %file1_data; my %file2_data; print "\n\n"; # Process data files ProcessFile( "File1", \%file1_data ); ProcessFile( "File2", \%file2_data ); # Format output my @keys = sort keys %file1_data; foreach my $key ( @keys ) { my $target = exists $file2_data{$key} ? $file2_data{$key} : ''; print "$key $file1_data{$key} $target\n"; } # Process a file write data into supplied hash ref. sub ProcessFile { my $filename = shift; my $data = shift; # Pass data by reference - big hashes used +here. open( DATAFILE, '<', $filename ) or die "Unable to open $filename - $!\n"; # Store data in hash. # Only the last instance of any key is stored. while ( my $line = <DATAFILE> ) { my ($key, $value) = split /\s+/, $line, 2; # split into max o +f 2 fields. $data->{$key} = $value; } close( DATAFILE ) or die "Unable to close $filename - $!\n"; }

    As others have said, we only open and read each file once. That is a huge savings.

    I put the identical file processing code into a subroutine. I passed a hash reference to the subroutine to speed up data exchange--returning a hash would force a copy of each hash be made as data is passed back from the function. If you prefer to avoid the non-obvious munging of an argument the sub could look like:

    my $file1_data = ProcessFile("File1"); my $file2_data = ProcessFile("File2"); # ... sub ProcessFile { my $filename = shift; my %data; #... return \%data; }

    Both are about equally efficient, and which you use is a stylisitic issue.

    Keep using strict and the 3 argument version of open. Use the warnings pragma rather than the '-w' switch.

    Sorts are expensive, you want to do them the least number of times possible.

    You may want to look into how to measure the computational complexity of an algorithm and 'big O notation'. In practice, I have found that the need to do this type of analysis is limited, but learning it will give you an intuitive sense of what types of things are costly--which will improve your algorithms.

    Keep up the good work!


    TGI says moo

      This is the worst one yet! How much memory do you think he has? He'd have to be on a 64-bit platform with 8GB of ram for this script to have a chance of running to completion. Check out Devel::Size and think about what's going to happen when you put nearly 5GB of data into hashes!

      -sam

        Well, doesn't everyone have an S390?

        Assuming that all his keys are unique, you are right, the OP will need to use some sort of disk based storage--he probably doesn't have enought RAM. Perhaps an SQL database (SQLite anyone) or a dbm file would do the trick.

        Any reason why tying each hash to a DB_File would be a bad idea? The largest datasets I've had to deal with have only been in few 10s of megabytes in size. So, is DB_File up to the task?

        Would you care to offer a suggestion?

        Edit: Upon reviewing the OP's code, it looks like he is keeping both hashes in RAM at the same time (at least the worst case memory consumption for my code and the OPs is the same). If his code runs to completion as posted, so will mine. I also noted your suggestion to try DB_File, and so struck my request for suggestions.


        TGI says moo

Re: Perl Code Runs Extremely Slow
by TGI (Vicar) on Jun 15, 2006 at 20:46 UTC

    Upon reviewing this thread, a new concern occurred to me. You may not be getting the results you want with this code. A hash only stores one value for each key,

    For example, if your data looks like:

    File 1:
    a  1
    b  2
    c  3
    d  4
    a  5
    
    File2:
    a  10
    b  11
    c  20
    c  12
    

    Your output would be something like:

    a  5  10
    b  2  11
    c  3  12
    d  4
    

    If you were looking for something like:

    a  1   1
    a  5  10
    b  2  11
    c  3  20
    c  3  12
    d  4
    
    you'll need to make some decions about how you handle duplicate keys in your files.

    If you need to keep all your data, I recommend reading the data structures cookbook.

    You will also need to move your data out of RAM and onto a disk (DB_File is probably the easiest way), since you won't be overwriting your data. The RAM issues samtregar mentioned will become particularly acute. If you are going to be building complex data structures, MLDBM will work better for you than DB_File.

    BTW, to create a hash tied to a DBM file, you can use:

    my %file1; tie %file1, "DB_File", "file1.data" ;


    TGI says moo

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://555114]
Approved by samtregar
Front-paged by friedo
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others examining the Monastery: (12)
As of 2014-11-21 16:40 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My preferred Perl binaries come from:














    Results (114 votes), past polls