Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

Re: Optimizing I/O intensive subroutine

by BrowserUk (Pope)
on Oct 26, 2012 at 16:43 UTC ( #1001125=note: print w/ replies, xml ) Need Help??


in reply to Optimizing I/O intensive subroutine

Running your routine on 7 files of 200,000 lines apiece (with limit = 1000), takes just 10.5 seconds; and on 100x 200,000 lines takes 145 seconds on my machine.

Showing (as expected) that the runtime is pretty linear with respect to the number of files.

Which make your figures (of 40s for 7 and 1500s for 100) suggest that the majority of time is being spent outside of this routine doing something non linear.


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

RIP Neil Armstrong


Comment on Re: Optimizing I/O intensive subroutine
Re^2: Optimizing I/O intensive subroutine
by hperange (Beadle) on Oct 26, 2012 at 22:31 UTC

    Thanks for the hints, I will try the split suggestion. The default value of $limit is 20

    I suppose you were testing on a desktop machine; pushing the boundaries of my knowledge, is it possible that I get a very different result because I am mainly testing on servers (HP-UX), where there is a possibility that the heavy use of I/O is slowing down the process ?

    I am sure there is nothing else causing the difference because apart from this routine the program only contains code to pretty print the results of this routine, and a few sanity checks ( 3x stat syscall per number of files to be read )

    Unfortunately I cannot provide a trace from the machines where it takes longer to run, as they are "production" and I am not allowed to play with them :)

      I suppose you were testing on a desktop machine; pushing the boundaries of my knowledge, is it possible that I get a very different result because I am mainly testing on servers (HP-UX), where there is a possibility that the heavy use of I/O is slowing down the process ?

      Looking at the second profiler image, it isn't IO that is constraining your process, but rather memory allocation. Which is rather difficult to understand given that processing 100x 200,000 line files only requires 36MB & 147 seconds on my machine.

      The only way I can see that happening is if the server in question is so overloaded memory wise that it is permanently swapping, so that every split to an array and push to an array requires the machine to wait for substantial IO. And given that the profiling shows that the actual file IO done by your program isn't taking long at all, then the server must be using a ludicrously heavily over-subscribed separate swap partition.

      Under those circumstances, there is nothing you can do to your script that will improve its performance. The only way will be to get the processing moved to a server with more (SOME!) free resource.

      Were I you, I would run the program on my workstation (or development server) and profile it there, and then take the two profiles to someone in authority and show them how badly the server resources are being managed. Demonstrate not only that your process is being dramatically choked by being run on a server that is totally inadequately provisioned for the tasks being given it; but that all the other processes running on that same server, will be being similarly choked and hampered by the same problems.

      Just for completeness, here is my test setup:

      And a run on the 100 files with limit set to 20:

      C:\test>junk47 -N=100 -L=20 147.615

      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

      RIP Neil Armstrong

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1001125]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others having an uproarious good time at the Monastery: (7)
As of 2014-07-26 03:31 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (175 votes), past polls