http://www.perlmonks.org?node_id=1015925

austinj has asked for the wisdom of the Perl Monks concerning the following question:

I have the following code where I loop through several files (approx 300) and get the corresponding values from them. this builds a large hash where the first key is the profile name, followed by the info you want. Unfortunately this slows down significantly when I run a large number of files. It runs approx 0.05 seconds per profile if I do < 30 files but if I do a large number (300+) it takes up to 1.0 seconds per profile. Is there a way I can initialize the hash or someway speed this up when it is running a large number? Thanks

foreach my $customProf (@{scalar2array($Settings->{custom_pl +ots}[$i][$t]{profiles})}){ if(exists $profEvntVar->{$customProf}){next;} my ($evnt_nums_ref,$var_names_ref,$variables_ref,$last_evn +t_name) = getvars($customProf); $profEvntVar->{$customProf}{events} = $evnt_nums_ref; #Li +nk profiles with their respective events $profEvntVar->{$customProf}{times} = $evnt_times_ref; #Li +nk profiles with their respective events $profEvntVar->{$customProf}{vars} = $var_names_ref; # +Link profiles with their respective variables $profEvntVar->{$customProf}{varsref} = $variables_ref; + #Link profiles with their respective variables $profEvntVar->{$customProf}{lastEvent} = $last_evnt_name; + #Link profiles with their respective variables } sub getvars{ my $profile = shift(@_); my @ptk_info = `<system_command_here>`; my $evnt_nums_ref; my $events_found = 0; my $vars; my $prof_var_names; my $last_evnt_name; foreach (@ptk_info){ if(/^\s*(\d*\.\d*)0\s-\s(.*?)\s*$/){ $evnt_nums_ref->{$2} = $1; $last_evnt_name = $2; } if($events_found == 1 && /^\b/){ push( @{$vars}, split(" ") ); } elsif($events_found == 0 && /^Events$/){ $events_found = 1; } } foreach (@{$vars}){ if(/(.{1,8})\S*:\S*/){ $prof_var_names->{$1} = $_; } else{ $prof_var_names->{$_} = $_; } } my @sorted_vars = sort { lc($a) cmp lc($b) } @{$vars}; return $evnt_nums_ref,$prof_var_names,\@sorted_vars,$last_evnt_name;

Replies are listed 'Best First'.
Re: Speed up hash initialization loop
by graff (Chancellor) on Jan 30, 2013 at 05:15 UTC
    How many iterations of the initial "foreach" loop are you doing (i.e. how many "profiles" are there)? Is it the same set of 300+ files that the "getvars" sub is loading on each iteration, or does each "profile" bring in its own set of distinct files?

    If it's the same 300 files each time, you might see a big difference if you figure out how to restructure the looping so that you read each file exactly once, and populate all the profiles in that one pass over each file. But I'm only guessing, because you haven't provided enough info about the problem (number of profiles, total amount of data in the files, what manner of "system_command" are you running for each file).

    Apart from that, anything you do to simplify the "getvars" code will help some; e.g.:

    - don't use references to hashes and arrays when you don't need to ("prof_var_names" and "evnt_nums_ref" should just be plain hashes; you can return them as refs the same way you do "sorted_vars", and "vars" should just be @vars).

    - use a "pipe open" to run your system command, read from the pipe until you see /^Events$/, then read the data of interest - i.e.:

    sub getvars { my $profile = shift; my ( @vars, %evnt_nums, %prof_var_names, $last_evnt_name ); open( my $ptk_info, '-|', "system command here" ) or die "$profile +: $!\n"; while (<$ptk_info>) { last if ( /^Events$/ ); # skip lines till this line is found } while (<$ptk_info>) { my @tkns = split; if ( $tkns[0] =~ /^(\d*\.\d*)0/ ) { $last_evnt_name = $tkns[2]; $evnt_nums{$last_evnt_name} = $1; } push @vars, @tkns; } ... # (do other for loop, sort @vars return \%event_nums, \%prof_var_names, \@sorted_vars, $last_evnt_n +ame; }

      There are 300 profiles, getvars only sees each of the profiles once (it takes a profile location as an argument) I am checking to make sure I haven't already ran the profile so as to not run it twice. I switched to the pipe open as suggested (no significant change in runtime) I also changed all hash/array refs to standard hashes and returned the refs as suggested (again no significant runtime change) The files themselves are relatively small and the system command returns approx 30 lines of text which I use in the regex Thanks for the help

        How large will @$vars be? If larger than a hundred elements or so, you will benefit from a Schwartzian transform:

        my @sorted_vars = map { $_->[1] } sort { $a->[0] cmp $b->[0] } map { [ lc $_, $_ ] } @{$vars};

        I'm afraid I'm as out of ideas as the other posters here -- your only recourse is to use a profiler and find the bottlenecks that way.

Re: Speed up hash initialization loop
by Anonymous Monk on Jan 29, 2013 at 20:27 UTC

    How much time is spent building those hashes, and how much time is spent on disk IO?

    I also see a lot of stars in your regexes (at least they're not deathstars .*). I'm sure they're not all needed. For example, in: if(/(.{1,8})\S*:\S*/){ the last \S* accomplishes nothing since it isn't captured.

      Thanks for the advice, I'm pretty sure the regex runs pretty quick - the reason being if I only run a couple profiles (20) the whole routine runs at 0.07 seconds per profile. However if I run a large number (300+) it slows down to ~1.0 seconds per profile (average). I assume this means that something with initializing/re-allocating memory to the hash is what is slowing me down, not the regex.

      Either way I took your advice and removed some unnecessary parts of the regex. But it still runs at approx the same speed.

      I just ran one more test, I ran 100 profiles, average approx 1 second per profile. I then ran 3 profiles (that were in the 100 set) and specifically those 3 had taken 2+ seconds to run. Now with only 3 they each ran in 0.07 seconds or less. I'm not sure why it seems to already know it has a lot of profiles ... unless, I'm passing the profiles in on the command line example /home/profile_* , maybe it has to run this "ls" type command every time? it seems it should only run it once, but I'm not sure I set it up that way. I'm pulling them in like this:

      my $arg_profs  =   \@ARGV; # set the remaining arguments AFTER you have read the template
Re: Speed up hash initialization loop
by bulk88 (Priest) on Jan 30, 2013 at 03:12 UTC
    write $profEvntVar->{$customProf} only once, assign the ref to a lexical, then deref the lexical each time. You cut 2 lookups to 1 on each line.

      bulk88:

      So the compiler doesn't do a 'common subexpression elimiation' optimization? Or does it do such a thing, but it can't optimize that due to the possibility of too much "magic" going on?

      ...roboticus

      When your only tool is a hammer, all problems look like your thumb.

        Correct. $profEvntVar may be magical and return a different hash every time its read. In that hash, slice {$customProf} might be magical and different every time. If you write the var X many times in source, it will be called/read X many times in source. There is no caching.
        I'm not using a compiler... actually I didn't even know there where perl compilers, but if you point me in the right direction I'd be happy to learn

      This sounds like what I need to do but I don't understand where to deref. Should this be within the subroutine? If it is outside of the subroutine, I think I'm already doing this, my %profEvntVar is defined above, and in the end I need it to contain all of the info about each profile in @profiles

      Sorry if I'm missing something

        if(exists $profEvntVar->{$customProf}){next;} my $cprof = $profEvntVar->{$customProf} = {}; my ($evnt_nums_ref,$var_names_ref,$variables_ref,$last_evn +t_name) = getvars($customProf); $cprof->{events} = $evnt_nums_ref; #Link profiles with th +eir respective events $cprof->{times} = $evnt_times_ref; #Link profiles with th +eir respective events # etc

        FWIW, I don't think there'll be much gain (it should be in the order of microseconds), but since they're in a loop it might add up. This is not really an optimisation technique, but a code cleanup one.

        (Do change the naming of $cprof if you can think of a better name.)

Re: Speed up hash initialization loop
by clueless newbie (Curate) on Jan 30, 2013 at 17:42 UTC