Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

Optimizing I/O intensive subroutine

by hperange (Beadle)
on Oct 26, 2012 at 14:01 UTC ( #1001087=perlquestion: print w/ replies, xml ) Need Help??
hperange has asked for the wisdom of the Perl Monks concerning the following question:

Hello,

I have a program which runs slightly slower than I would like it to run. I profiled it, and it spends almost 100% of the time in 1 subroutine which is not surprising, as this is the only subroutine which is reading text input. Below is the subroutine, and a demonstration of the size of the input.

# returns the $limit largest files from the flist file, # full path to file in $name sub process_flist { my ($name, $limit) = @_; my ($nlines, $total, @lines, @size); open(my $fh, '<', $name) or die("Error opening file `$name': $!\n"); while (<$fh>) { my @f = split / /; # skip files that have a space or other whitespace # characters in their name next if @f > 10; # store file size in bytes and the full path to the file push @lines, $f[4] . '/' . $f[1]; } $nlines = scalar @lines; { # disable warnings because the array to be sorted has # the following format "12345/path/to/file" # Perl would complain this is not a number # but the <=> comparison operator will handle such # input properly # this is needed so the files can be sorted # with a single pass through # the flist file no warnings 'numeric'; $total = sum(@lines); $limit = min($limit, $nlines); @lines = (sort {$b <=> $a} @lines)[0 .. ($limit - 1)]; } # returns the number of files, their cumulative size, # and the $limit largest files return ($nlines, $total, @lines); }
The routine opens an input file, and returns the n largest files. Input size demonstration:
find /tgt -type f -name input | xargs wc -l 197898 .../input 213267 .../input 240331 .../input 194063 .../input 191862 .../input 179495 .../input 218041 .../input 1434957 total
The format of a single input record is:
51 opt/src/t.tar 100444 1247464676 290119680 283320 NA 1 0xbe2d 0x4000 +0006
Where the 2nd field is the name of a file, and the 5-th is the file size in bytes

The program runs for around 40 seconds with this input - 7 input files, but with 100 input files of similar size it runs for around 25 minutes. I have 2 pictures of the profiler output - http://imgur.com/FUOqb,iWF5z#1

Can the runtime be reduced ? I am not proficient in interpreting the output of the profiler, so I can't really figure out whether it can be improved, or it is simply so I/O intensive that not much can be done. The input files represent a list of files backed up from a particular client, and they are not ordered in any meaningful way.

EDIT:

Below code is apparently faster (at the cost of some readability) than split/if

... while (<$fh>) { next unless m/^[0-9]+ ([^ ]+) [0-9]+ [0-9]+ ([0-9]+)/; push @lines, $2 . '/' . $1; } ...

Comment on Optimizing I/O intensive subroutine
Select or Download Code
Re: Optimizing I/O intensive subroutine
by tobyink (Abbot) on Oct 26, 2012 at 14:49 UTC

    I'm clutching at straws here, but if a lot of the filenames contain multiple whitespace characters, then you might get a very slight improvement doing this:

    my @f = split / /, $_, 11; # skip files that have a space or other whitespace # characters in their name next if @f > 10;

    The third parameter to split tells it not to keep looking for split points once it's found 10.

    perl -E'sub Monkey::do{say$_,for@_,do{($monkey=[caller(0)]->[3])=~s{::}{ }and$monkey}}"Monkey say"->Monkey::do'
Re: Optimizing I/O intensive subroutine
by chromatic (Archbishop) on Oct 26, 2012 at 16:16 UTC

    What happens if you return an array reference instead of an array at the end of this function? (Depending on how large $limit is, you might cause Perl to copy a lot of data.)

    What happens if, instead of reading the whole file into an array and then sorting it, you keep track of the largest n lines you've seen and read the file line by line?


    Improve your skills with Modern Perl: the free book.

      The latter suggestion is more promising.

      Looking at the timings, the bulk of the time is in the split (20.4 s) and the push @lines (16.2s)

        Ah, I didn't see there was a second image. Good catch.

Re: Optimizing I/O intensive subroutine
by BrowserUk (Pope) on Oct 26, 2012 at 16:43 UTC

    Running your routine on 7 files of 200,000 lines apiece (with limit = 1000), takes just 10.5 seconds; and on 100x 200,000 lines takes 145 seconds on my machine.

    Showing (as expected) that the runtime is pretty linear with respect to the number of files.

    Which make your figures (of 40s for 7 and 1500s for 100) suggest that the majority of time is being spent outside of this routine doing something non linear.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    RIP Neil Armstrong

      Thanks for the hints, I will try the split suggestion. The default value of $limit is 20

      I suppose you were testing on a desktop machine; pushing the boundaries of my knowledge, is it possible that I get a very different result because I am mainly testing on servers (HP-UX), where there is a possibility that the heavy use of I/O is slowing down the process ?

      I am sure there is nothing else causing the difference because apart from this routine the program only contains code to pretty print the results of this routine, and a few sanity checks ( 3x stat syscall per number of files to be read )

      Unfortunately I cannot provide a trace from the machines where it takes longer to run, as they are "production" and I am not allowed to play with them :)

        I suppose you were testing on a desktop machine; pushing the boundaries of my knowledge, is it possible that I get a very different result because I am mainly testing on servers (HP-UX), where there is a possibility that the heavy use of I/O is slowing down the process ?

        Looking at the second profiler image, it isn't IO that is constraining your process, but rather memory allocation. Which is rather difficult to understand given that processing 100x 200,000 line files only requires 36MB & 147 seconds on my machine.

        The only way I can see that happening is if the server in question is so overloaded memory wise that it is permanently swapping, so that every split to an array and push to an array requires the machine to wait for substantial IO. And given that the profiling shows that the actual file IO done by your program isn't taking long at all, then the server must be using a ludicrously heavily over-subscribed separate swap partition.

        Under those circumstances, there is nothing you can do to your script that will improve its performance. The only way will be to get the processing moved to a server with more (SOME!) free resource.

        Were I you, I would run the program on my workstation (or development server) and profile it there, and then take the two profiles to someone in authority and show them how badly the server resources are being managed. Demonstrate not only that your process is being dramatically choked by being run on a server that is totally inadequately provisioned for the tasks being given it; but that all the other processes running on that same server, will be being similarly choked and hampered by the same problems.

        Just for completeness, here is my test setup:

        And a run on the 100 files with limit set to 20:

        C:\test>junk47 -N=100 -L=20 147.615

        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        RIP Neil Armstrong

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1001087]
Approved by kcott
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others contemplating the Monastery: (10)
As of 2014-09-16 19:20 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (45 votes), past polls