Beefy Boxes and Bandwidth Generously Provided by pair Networks vroom
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

Re^2: Optimize my code with Hashes

by sukhicool (Initiate)
on Aug 27, 2008 at 11:05 UTC ( #707131=note: print w/ replies, xml ) Need Help??


in reply to Re: Optimize my code with Hashes
in thread Optimize my code with Hashes

Total code took : 68258 secs

While LDAP Update/Add entry took:

the code took:67358 wallclock secs (29935.47 usr 60.38 sys + 0.00 cusr 0.00 csys = 29995.85 CPU) in considering each entry from PeopleFirst extract ...

1. Do older versions of PERL has bad algorithms for handling of HASHES ?
2. Will it help if we use Arrays instead of Hashes?

If upgrading the PERL version will help, I will try to convince the management if you can help me in locating some url which quotes this.


Comment on Re^2: Optimize my code with Hashes
Re^3: Optimize my code with Hashes
by jethro (Monsignor) on Aug 27, 2008 at 11:22 UTC

    You are asking the wrong questions. You already blame the hashes without really knowing what wastes all this time. Everyone above told you to do some profiling first and that is a really good idea. Often it is the algorithm used that kills the time.

    For example: If your program often makes a copy of your hash then that will really thrash your memory and cost time. But you won't get it faster by changing to an array then.

    Also you didn't show us what the code does with the hash you blame. Then how should we be able to tell you if an array ist better

    So either post some code here or do some profiling or both.

      Sorry if anyone is hurt by my wrong questions.

      Let me show some code:
      The following hash is getting generated from a subroutine: %pfinfo = { '069836' => '069836|Henion,David|A|Active|010474|HAWKEY,Mi +chael G|SC3798|...' , '025939' => '025939|Picard, Stephane|A|Active|010101|LEPINE,Thi +bault|SG8778|...' , ...} my $timee0 = new Benchmark; foreach my $en (keys %pfinfo) { logAndSkip(\*LOG,"Considering the entry from PeopleFirst extract: $e +n...") if ($log); # Get the PF information my @pfi=(); #reset the array @pfi=split/\|/,$pfinfo{$en}; # If employee number does not exists in ED, it looks like a creation if (!exists $ed_en{$en}) { createEDentry(\*LOG,\@pfi,\%used_dn,\%en2dn); } # Looks like an ED entry update else { updateEDentry(\*LOG,$en2dn{$en},\@pfi,\%en2dn); } } # End Foreach

        Sorry if my answer sounded too harsh, I'm not a native english speaker myself

        The part of the code you show seems to be quite efficient and from what I can see the author of this code knows how to program in perl. I even tried out the program (the bit you posted) on my machine and it needed just over 1 second (3 seconds on a sun blade 100) to init the hash with 50000 bogus entries and run the loop over it. Naturally with empty subroutines, so no surprise really.

        What you don't show is what createEDentry and updateEDentry do. Probably that is where most of the work is done.

        If you want to try yourself, here is my test code. Run it on your machine, if it takes less than 20 seconds, the problem is not in the code you have shown us.

        my %used_dn=(); my %en2dn=(); my %ed_en; my %pfinfo=(); my $i=50000; while ($i>0) { $pfinfo{$i--}= "$i|Henion,David|A|Active|010474|HAWKEY,Michael G|S +C3789"; } my $log=0; $i=0; foreach my $en (keys %pfinfo) { logAndSkip(\*LOG,"Considering the entry from PeopleFirst extract: $e +n...") if ($log); # Get the PF information my @pfi=(); #reset the array @pfi=split/\|/,$pfinfo{$en}; # If employee number does not exists in ED, it looks like a creation if (!exists $ed_en{$en}) { createEDentry(\*LOG,\@pfi,\%used_dn,\%en2dn); } # Looks like an ED entry update else { updateEDentry(\*LOG,$en2dn{$en},\@pfi,\%en2dn); } } # End Foreach sub LogAndSkip {} sub createEDentry { my ($LOG,$pfi,$used,$en)=@_; } sub updateEDentry { my ($LOG,$pfi,$en)=@_; }
Re^3: Optimize my code with Hashes
by jbert (Priest) on Aug 27, 2008 at 12:00 UTC
    the code took:67358 wallclock secs (29935.47 usr 60.38 sys + 0.00 cusr 0.00 csys = 29995.85 CPU) in considering each entry from PeopleFirst extract ..

    That's interesing. So 29935/67358=44% of your time was spent on user CPU. That is significant, and you might want to look into profiling the app's CPU usage (using Devel::Profile and Devel::DProf).

    Of course, it also means that 56% of your time is spent doing other things. If that is net latency then you'd do well to look at bulk import/export instead.

    Your timestamp logging appeared to show ~6secs for one request, is that right? That can't be representative, since as noted elsewhere in this thread, you'd never manage to do 50k updates in 18 hours if each takes 6s.

    Lastly, if you do profile the app, then it will probably benefit you to produce a cut-down version which runs more quickly. This is useful because the profilers slow things down and generate large amounts of data - they'll probably break on such a big run.

    Also, having a more quickly repeatable test case (e.g. ~10mins) will greatly accelerate your ability to test ideas on code and algorithm changes.

    However, the hard part is knowing if your cut-down test case has the same performance profile as your main job run.

    Another thought: if the 'missing' 56% of your time is overnight you might be sharing a network with a backup job, or something else which saturates the net and makes your network response times go very slowly.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://707131]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others about the Monastery: (5)
As of 2014-04-17 02:12 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    April first is:







    Results (437 votes), past polls