Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

Optimization Help on Perl Hash Traversal (eval use)

by mwb613 (Novice)
on Feb 06, 2013 at 21:56 UTC ( #1017510=perlquestion: print w/ replies, xml ) Need Help??
mwb613 has asked for the wisdom of the Perl Monks concerning the following question:

Hi All,

Thanks for looking.

I am hoping to optimize (read: speed up) a script that produces statistics from call log files. The files are flat files in which each line describes a leg of a SIP phone call. Each call has a unique ID but the call could have multiple legs (therefore multiple occurrences of the ID in each file. A short summary of the method would be this:

1: Loop through log file and create a hashref keyed on the unique ID of the call that captures certain columns from the file. e.g.:

@this_line = split(/;/,$_); $session_id = $this_line[0]; $call_leg_index = $this_line[1]; $dur = $this_line[2]; $pdd = $this_line[3]; $call_ids->{$session_id}->{$call_leg_index}->{duration} = $dur; $call_ids->{$session_id}->{$call_leg_index}->{post_dial_delay} = $pdd;

2: Here's where the slowness is coming in. I loop through the resulting hash (based on session_id) and use an inner loop to run a few eval blocks in order to calculate my statistics. The strings that I put in the eval block are pulled from a hashref I created from a "config table" in mySQL (uses a fetchall_hashref so I'm only doing it once).

for my $this_call_id ( sort keys %$call_ids ) { $count++; next if !$this_call_id; my $route_attempts = 0; for my $this_index ( sort keys %{$call_ids->{ $this_call_id }} ) { $route_attempts++; foreach $aggregate_name ( keys(%{$agg_snippets})){ my $this_group_data = eval $grouping_data_eval; my $snippet = $agg_snippets->{$aggregate_name}->{'snippet' +}; $summary_data->{$this_group_data}->{$aggregate_name} = 0 i +f !$summary_data->{$this_group_data}->{$aggregate_name}; $summary_data->{$this_group_data}->{$aggregate_name} += ev +al $snippet; } } }

3. Loop through newly created hash and push stats to a mySQL database. This is working very quickly.

The idea behind the eval blocks is to add a layer of abstraction so that when I need to add additional statistical analyses I can add entries to mysql with the proper eval string. I know this could be done via flat file or XML but I don't believe it is costing any extra time to dip the DB once to get my eval strings.

I am looking to have as near to real time statistics as I can but the above process is really bogging down during the loop which groups data into the second hashref. For my purposes, which is to push this data into a database that can be queried and graphed, the process will get bogged down to the point where I don't believe I'll be able to catch up. (The logs cut off every 5 minutes) and the process is taking about that long on a large file (100k rows). When I look at TOP it shows Perl using about a full processor but not a whole lot of memory (~4%).

I'm hoping that someone might have an idea that could help speed the process up. I'm not looking for anyone to write code for me. Just point me in the right directions or drop a few cryptic terms that I can research.

This is done in Perl 5.10.1 on Centos 6, fyi.

Comment on Optimization Help on Perl Hash Traversal (eval use)
Select or Download Code
Re: Optimization Help on Perl Hash Traversal (eval use)
by BrowserUk (Pope) on Feb 06, 2013 at 22:07 UTC
    The idea behind the eval blocks is to add a layer of abstraction so that when I need to add additional statistical analyses I can add entries to mysql with the proper eval string.

    The most effective optimisation would be to avoid (re)-evaling your snippets for every id.

    And the easiest way to do that would be to construct your snippets so that they can be eval'd into subroutines once each, and then you can call the appropriate subroutine for each ID instead.

    As you haven't posted your snippets, I can't offer a realistic example, but by way of giving you an idea, something line:

    $_ = sub{ $_ } for keys %{$agg_snippets};

    Would (assuming the snippets are correctly defined, turn the snippets into subroutines.

    You then just invoke the appropriate subroutine passing the data as arguments; and your code should run substantially faster.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      Thanks so much!

      Here are two examples of the snippets:

      '{ceil(($call_ids->{$this_call_id}->{$this_index}->{\"duration_milliseconds\"} / 1000.) / 6)*6 / 60;}'

      and

      '{if($call_ids->{$this_call_id}->{$this_index}->{"release_code"} == 503){1;} else{0;}}'

      and this bit of ugliness used to create the grouping values in the hash

      '{my $route = $call_ids->{$this_call_id}->{$this_index}->{''route''};my $src_state = $call_ids->{$this_call_id}->{$this_index}->{''o_state''}; my $dest_state = $call_ids->{$this_call_id}->{$this_index}->{''t_state''}; my $juris_indicator = "f";if(!$src_state){$juris_indicator = "c";}elsif($src_state eq $dest_state){$juris_indicator = "a";}else {$juris_indicator = "b";};$route =~ /^[a|b|c](1[2-9][0-9]{2}[2-9][0-9]{2})/;$corrected_route = $juris_indicator .  $1;$corrected_route = $route if $route =~ /loop|none|lnp_error|no_juris_digits/;$corrected_route = $route if !$corrected_route;$ret_val = "$call_ids->{$this_call_id}->{$this_index}->{''day''},$call_ids->{$this_call_id}->{$this_index}->{''day_chunk''},$call_ids->{$this_call_id}->{$this_index}->{''o_trunk''},$call_ids->{$this_call_id}->{$this_index}->{''t_trunk''},$call_ids->{$this_call_id}->{$this_index}->{''route''},$corrected_route";}'

      I'm a little fuzzy on what you're describing as I've never attempted it before but are you creating a dynamic, anonymous function? If there is a name for what you're describing let me know and I'll do some research. I'm sure it would go a long way in clarifying that last block of code.

      Thanks!

        Your 3 snippets can be easily converted to (far more readable) subroutines thus:

        'sub { my( $call_ids, $call_id, $index ) = @_; ceil( ( $call_ids->{$call_id}->{$index}->{duration_milliseconds} / 1 +000. ) / 6 ) * 6 / 60; }' 'sub { my( $call_ids, $call_id, $index ) = @_; if($call_ids->{$call_id}->{$index}->{ release_code } == 503 ){ 1; } else{ 0; } }' 'sub { my( $call_ids, $call_id, $index ) = @_; my $route = $call_ids->{$call_id}{$index}{ route }; my $src_state = $call_ids->{$call_id}{$index}{ o_state }; my $dest_state = $call_ids->{$call_id}{$index}{ t_state }; my $juris_indicator = 'f'; if( !$src_state ){ $juris_indicator = 'c'; } elsif( $src_state eq $dest_state ){ $juris_indicator = 'a'; }else { $juris_indicator = 'b'; }; $route =~ /^[a|b|c](1[2-9][0-9]{2}[2-9][0-9]{2})/; $corrected_route = $juris_indicator . $1; $corrected_route = $route if $route =~ /loop|none|lnp_error|no_jur +is_digits/; $corrected_route = $route if !$corrected_route; join ',', $call_ids->{$call_id}{$index}{ day }, $call_ids->{$call_id}{$index}{ day_chunk }, $call_ids->{$call_id}{$index}{ o_trunk }, $call_ids->{$call_id}{$index}{ t_trunk }, $call_ids->{$call_id}{$index}{ route }, $corrected_route; }'

        Once you have loaded them into your $agg_snippets hash, those text snippets can be replaced by instantiated subroutines in one pass using eval like this:

        $agg_snippets{ $_ }{snippet} = eval $agg_snippets{ $_ }{snippet} for k +eys %{ $agg_snippets };

        Then later, when you are processing the %$call_ids hash, you can invoke them like this:

        for my $this_call_id ( sort keys %$call_ids ) { $count++; next if !$this_call_id; my $route_attempts = 0; for my $this_index ( sort keys %{ $call_ids->{ $this_call_id } } +) { $route_attempts++; foreach $aggregate_name ( keys %{ $agg_snippets } ){ my $this_group_data = eval $grouping_data_eval; $summary_data->{$this_group_data}{$aggregate_name} = 0 if + !$summary_data->{$this_group_data}->{$aggregate_name}; $summary_data->{$this_group_data}{$aggregate_name} += $agg_snippets->{$aggregate_name}{snippet}->( $call_id +s, $this_call_id, $this_index ); # ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ } } }

        That still leaves the eval of the $grouping_data_eval, which you give no information for, but it should also be possible to eliminate that eval by instantiating it into a subroutine once near the top of the code.

        The overall effect should be to substantially speed up the processing. (BTW: Note how much clearer things are with: a) a little formatting; b) the omission of unnecessary punctuation; c) a little extra horizontal whitespace.)


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
      Your example does not seem quite clear since it seems that the value of the subroutine is lost each time. We make a subroutine but where do we put the coderef? Do you mean, say:
      $subs{$_} = sub { $_} for ... ?
Re: Optimization Help on Perl Hash Traversal (eval use)
by CountZero (Bishop) on Feb 06, 2013 at 22:21 UTC
    I wonder if there is any need to sort the keys of your outer and middle loop. If you are looking for any, even small, time-savings, here might be a quick bonus.

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

    My blog: Imperial Deltronics

      No good reason except that I'm bit robotic in my code sometimes. I'll pull out the "sort" commands.

      thanks!

Re: Optimization Help on Perl Hash Traversal (eval use)
by CountZero (Bishop) on Feb 06, 2013 at 22:48 UTC
    Adding to BrowserUK's suggestion, you could store your snippets into a YAML file and have them automatically eval-ed upon loading the YAML-file as part of your initialization.

    Here is an example:

    use Modern::Perl; use YAML qw/DumpFile LoadFile/; local $YAML::UseCode = 1; my $code = { snippet1 => sub { my $counter; for ( 1 .. 10 ) { say; $counter++ } say $counter; } }; DumpFile( './snippet.yml', $code ); my $snippets = LoadFile('./snippet.yml'); # <- Put this in the initial +ization part of your program &{ $snippets->{snippet1} }; # and run the snippet
    Here is the YAML-file:
    --- snippet1: !!perl/code | { use warnings; use strict; use feature 'say', 'state', 'switch', 'unicode_strings'; my $counter; foreach $_ (1 .. 10) { say $_; ++$counter; } say $counter; }
    As you see that is very well readable and easily editable. Probably easier than adding the code to your database.

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

    My blog: Imperial Deltronics

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1017510]
Approved by davido
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chilling in the Monastery: (5)
As of 2014-09-16 01:06 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite cookbook is:










    Results (155 votes), past polls