Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

Re: Optimizing I/O intensive subroutine

by BrowserUk (Pope)
on Oct 26, 2012 at 16:43 UTC ( #1001125=note: print w/ replies, xml ) Need Help??


in reply to Optimizing I/O intensive subroutine

Running your routine on 7 files of 200,000 lines apiece (with limit = 1000), takes just 10.5 seconds; and on 100x 200,000 lines takes 145 seconds on my machine.

Showing (as expected) that the runtime is pretty linear with respect to the number of files.

Which make your figures (of 40s for 7 and 1500s for 100) suggest that the majority of time is being spent outside of this routine doing something non linear.


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

RIP Neil Armstrong


Comment on Re: Optimizing I/O intensive subroutine
Re^2: Optimizing I/O intensive subroutine
by hperange (Beadle) on Oct 26, 2012 at 22:31 UTC

    Thanks for the hints, I will try the split suggestion. The default value of $limit is 20

    I suppose you were testing on a desktop machine; pushing the boundaries of my knowledge, is it possible that I get a very different result because I am mainly testing on servers (HP-UX), where there is a possibility that the heavy use of I/O is slowing down the process ?

    I am sure there is nothing else causing the difference because apart from this routine the program only contains code to pretty print the results of this routine, and a few sanity checks ( 3x stat syscall per number of files to be read )

    Unfortunately I cannot provide a trace from the machines where it takes longer to run, as they are "production" and I am not allowed to play with them :)

      I suppose you were testing on a desktop machine; pushing the boundaries of my knowledge, is it possible that I get a very different result because I am mainly testing on servers (HP-UX), where there is a possibility that the heavy use of I/O is slowing down the process ?

      Looking at the second profiler image, it isn't IO that is constraining your process, but rather memory allocation. Which is rather difficult to understand given that processing 100x 200,000 line files only requires 36MB & 147 seconds on my machine.

      The only way I can see that happening is if the server in question is so overloaded memory wise that it is permanently swapping, so that every split to an array and push to an array requires the machine to wait for substantial IO. And given that the profiling shows that the actual file IO done by your program isn't taking long at all, then the server must be using a ludicrously heavily over-subscribed separate swap partition.

      Under those circumstances, there is nothing you can do to your script that will improve its performance. The only way will be to get the processing moved to a server with more (SOME!) free resource.

      Were I you, I would run the program on my workstation (or development server) and profile it there, and then take the two profiles to someone in authority and show them how badly the server resources are being managed. Demonstrate not only that your process is being dramatically choked by being run on a server that is totally inadequately provisioned for the tasks being given it; but that all the other processes running on that same server, will be being similarly choked and hampered by the same problems.

      Just for completeness, here is my test setup:

      C:\test>dir junk47.dat.* 26/10/2012 17:04 16,266,705 junk47.dat.1 26/10/2012 17:04 16,266,705 junk47.dat.10 26/10/2012 17:04 16,266,705 junk47.dat.100 26/10/2012 17:04 16,266,705 junk47.dat.11 26/10/2012 17:04 16,266,705 junk47.dat.12 ... 26/10/2012 17:04 16,266,705 junk47.dat.95 26/10/2012 17:04 16,266,705 junk47.dat.96 26/10/2012 17:04 16,266,705 junk47.dat.97 26/10/2012 17:04 16,266,705 junk47.dat.98 26/10/2012 17:04 16,266,705 junk47.dat.99 101 File(s) 1,642,937,205 bytes C:\test>head junk47.dat.1 51 opt/src/1.tar 100444 1247464676 9476013183 283320 NA 1 0xbe2d 0x400 +00006 51 opt/src/2.tar 100444 1247464676 9802856445 283320 NA 1 0xbe2d 0x400 +00006 51 opt/src/3.tar 100444 1247464676 1116638183 283320 NA 1 0xbe2d 0x400 +00006 51 opt/src/4.tar 100444 1247464676 7417297363 283320 NA 1 0xbe2d 0x400 +00006 51 opt/src/5.tar 100444 1247464676 1416931152 283320 NA 1 0xbe2d 0x400 +00006 51 opt/src/6.tar 100444 1247464676 4827880859 283320 NA 1 0xbe2d 0x400 +00006 51 opt/src/7.tar 100444 1247464676 1016540527 283320 NA 1 0xbe2d 0x400 +00006 51 opt/src/8.tar 100444 1247464676 232543945 283320 NA 1 0xbe2d 0x4000 +0006 51 opt/src/9.tar 100444 1247464676 3099975585 283320 NA 1 0xbe2d 0x400 +00006 51 opt/src/10.tar 100444 1247464676 6366271972 283320 NA 1 0xbe2d 0x40 +000006 C:\test>tail junk47.dat.1 51 opt/src/199991.tar 100444 1247464676 5569458007 283320 NA 1 0xbe2d +0x40000006 51 opt/src/199992.tar 100444 1247464676 6560974121 283320 NA 1 0xbe2d +0x40000006 51 opt/src/199993.tar 100444 1247464676 9388122558 283320 NA 1 0xbe2d +0x40000006 51 opt/src/199994.tar 100444 1247464676 5976562500 283320 NA 1 0xbe2d +0x40000006 51 opt/src/199995.tar 100444 1247464676 8576354980 283320 NA 1 0xbe2d +0x40000006 51 opt/src/199996.tar 100444 1247464676 5962219238 283320 NA 1 0xbe2d +0x40000006 51 opt/src/199997.tar 100444 1247464676 6407470703 283320 NA 1 0xbe2d +0x40000006 51 opt/src/199998.tar 100444 1247464676 7785034179 283320 NA 1 0xbe2d +0x40000006 51 opt/src/199999.tar 100444 1247464676 8945312500 283320 NA 1 0xbe2d +0x40000006 51 opt/src/200000.tar 100444 1247464676 4286804199 283320 NA 1 0xbe2d +0x40000006 C:\test>wc -l junk47.dat.1 200000 junk47.dat.1 C:\test>type junk47.pl #! perl -slw use strict; use Time::HiRes qw[ time ]; use List::Util qw[ sum min ]; sub process_flist { my ($name, $limit) = @_; my ($nlines, $total, @lines, @size); open(my $fh, '<', $name) or die("Error opening file `$name': $!\n" +); while (<$fh>) { my @f = split / /; next if @f > 10; push @lines, $f[4] . '/' . $f[1]; } $nlines = scalar @lines; { no warnings 'numeric'; $total = sum(@lines); $limit = min($limit, $nlines); @lines = (sort {$b <=> $a} @lines)[0 .. ($limit - 1)]; } return ($nlines, $total, @lines); } our $L //= 1000; our $N //= 7; my $start = time; my( $n, $t, @l ) = process_flist( 'junk47.dat', $L ) for map "junk47.d +at.$_", 1.. $N; printf "%.3f\n", time() - $start;

      And a run on the 100 files with limit set to 20:

      C:\test>junk47 -N=100 -L=20 147.615

      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

      RIP Neil Armstrong

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1001125]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others studying the Monastery: (10)
As of 2014-11-26 13:10 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My preferred Perl binaries come from:














    Results (171 votes), past polls