Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

Performance Question

by Jarinn (Acolyte)
on May 08, 2002 at 13:27 UTC ( [id://165021]=perlquestion: print w/replies, xml ) Need Help??

Jarinn has asked for the wisdom of the Perl Monks concerning the following question:

I am trying to manipulate a VERY large file (81GB). The program simply reads in a line from the file, makes any needed changes and prints the line back out to another file on the same disk. The machine I am running the program seems adequate enough (gobs of swap memory remains untouched, IOWait is around 9%, CPU Idle around 70%, and load averages in the 1.25 region). However, monitoring the output, it looks like it will take 166 hours to run. That is a deal breaker. I have been told PERL reads in 8K at a time by default. Is there a way to change that? Or does anyone have any other suggestions, except to quit my pathetic attempts at programming and go back to Everquest... Thanks in advance..

Replies are listed 'Best First'.
Re: Performance Question
by tachyon (Chancellor) on May 08, 2002 at 16:06 UTC

    You can get data in whatever chunk size you want using read(). Here is an example that takes 24 seconds to process a 100MB file on my PIII with slow disks. That gives a throughput of 4MB per second which will process your 81GB file in under 6 hours. The optimal chunk size empirically is around 1MB with modest benefits increasing it to 2,4 and 8 MB. With smaller chunks you can here the heads flipping from one file area to the other - bigger chunks allow the heads to chill. At 64kB the run time was 57 seconds and the disks screamed. At 4MB the runtime was 23 seconds.

    If possible I would suggest reading from one disk and writing to a completely separate one (I did the testing on a single partition of a single disk). You could also roughly double the speed by forking a kid to do the disk write while the parent reads and processes more info. This will only help if you are reading from one disk and writing to another.

    #!/usr/bin/perl -w use strict; my $chunk = 2**20; # try 1MB to start but it may be faster to go bigg +er/smaller my $infile = 'c:/test.txt'; my $outfile = 'c:/out.txt'; open IN, $infile or die "Can't open $infile $!\n"; open OUT, ">$outfile" or die "Can't open $outfile $!\n"; my $buffer; my $partial_line = ''; my $start = time; while (read(IN, $buffer, $chunk)) { # we should only process full lines so we trim off the partial lin +e # that we inevitably get at the end of our read and save it into $ +2 $buffer =~ s/^(.*\n)([^\n]+)\z/$1/s; # add last partial line to front of buffer $buffer = $partial_line.$buffer; # save the current partial line for next loop so we can add it bac +k on $partial_line = $2 || ''; # make changes $buffer =~ s/this/that/g; print OUT $buffer; } print "Took ", time - $start, " seconds\n"; close IN; close OUT;

    cheers

    tachyon

    s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print

Re: Performance Question
by Elian (Parson) on May 08, 2002 at 13:54 UTC
    What's going on may depend on the OS and filesystem, but creating an 81G file by printing out line by line is going to be a killer. You're going to be constantly extending it piece by piece, probably ending up with a badly fragmented file. And even with the disk likely being a RAID volume of some sort (As it's got to be at least a 170G volume there) you're going to find the reads and writes competing for the disk heads a lot.

    This is as much, or more, a system administration issue as a perl one. Find your sysadmin, get him/her to fill you in on the characteristics of the system, and work with that. You may find that pre-extending the file is a good thing. There may be system parameters that can be set that govern the size of the new chunk your OS allocates when it extends a file. There may be cache settings that can be tweaked on the disk controller or individual files. This is obviously an adequately beefy machine, so you just need to do some performance tuning, and probably at the OS level.

    Hopefully you're not accessing the file via NFS or some other networked file system. Those will tend to kill your performance dead.

Re: Performance Question
by talexb (Chancellor) on May 08, 2002 at 13:49 UTC
    That's a tough question to answer without knowing a few more variables.

    • Is it OK if you run the machine to the rails -- or do you have to share processing HP with other users (human or nobodies)?
    • Is this a one off -- or are you going to have to this weekly/monthly?

    Jumping ahead to a solution, I would probably slice the monster file into pieces (lots of ways to do that) then process a couple of pieces in paralell. The way I would test that would be to take a 1G slice of the file and pretend that's the big file, and try various different piece counts.

    Failing that, write a program in C (something I've done many times) to suck the file in, 64K chunks at a time (or whatever size chunks your system can manage), then process the lines individually. The processed lines go into a 64K buffer, and when it gets full, you write it to the output file. Piece of cake. :) And you should get great performance doing it in C, better than Perl.

    --t. alex

    "Nyahhh (munch, munch) What's up, Doc?" --Bugs Bunny

      Would you really get a sizable performance increase by using c instead of perl to manipulate/print text? (honestly wondering)
        Depends on how good a C programmer you are. If you're reasonably good, yes. Probably a factor of two to four if the transforms are simple. More, possibly, depending on the IO subsystem. (It doesn't matter if your C program could run 50 times faster than the perl one if you've already maxed out your IO channel going twice as fast. You'll just twiddle your thumbs more)

        On the other hand it may take 5-10 times as long to write and debug the program, and maintenance/debugging it'll be a major pain relative to perl.

        A valid question. My guess is yes, but that's based on tuning the custom C program based on what system it runs on. It also depends if this is a one-time job or a weekly/monthly thing, as my initial post said. For a one-time thing, definitely go Perl. For a weekly job, it's worth the investment to write a really well-tuned, optimized C program.

        --t. alex

        "Nyahhh (munch, munch) What's up, Doc?" --Bugs Bunny

Re: Performance Question
by roboslug (Sexton) on May 08, 2002 at 16:38 UTC
    Ok, confirmed sysread. Using sysread and all the goo to make it use lines takes approx half the time of reading from <>.

    Been up all night working on stuff and now things are swirly (getting to old for this), so I know this code is ugly and all...but hey...

    #! /usr/bin/perl # Used a 78MB file for test... use Benchmark; $infile = "lala.txt"; $outfile1 = "alal_stdio.txt"; $outfile2 = "alal_sys.txt"; $t0 = new Benchmark; iotest1(); $t1 = new Benchmark; print "the stdio code took:",(timestr timediff($t1,$t0)),"\n"; $t0 = new Benchmark; iotest2(); $t1 = new Benchmark; $td = print "the sysread code took:",(timestr timediff($t1,$t0)),"\n"; sub iotest2 { open(IN,"< $infile"); open(OUT,"> $outfile2"); $buff = undef; while (sysread(IN,$in,8192)) { # This whole section could be redone...just soooo sleepy. # Only able to think in a linear fashion atm. @in = split /\n/,$buff.$in; $buff = pop @in; if ($in =~ /\n$/io) { $buff .= "\n"; } $out = join "\n",@in; # Do your thing.... chomp($out); print OUT $out."\n"; } print OUT $buff; } sub iotest1 { open(IN,"< $infile"); open(OUT,"> $outfile1"); while ($in = <IN>) { # do whatever.... print OUT $in; } } # END Script OUTPUT generally like: the sysread code took:11 wallclock secs (10.52 usr + 0.57 sys = 11.09 + CPU) the stdio code took: 5 wallclock secs ( 4.45 usr + 0.89 sys = 5.34 C +PU) the stdio code took:11 wallclock secs (10.42 usr + 0.62 sys = 11.04 C +PU) the sysread code took: 5 wallclock secs ( 4.62 usr + 0.73 sys = 5.35 + CPU)
Re: Performance Question
by samtregar (Abbot) on May 08, 2002 at 15:25 UTC
    Have you profiled your application (using Devel::DProf or the like)? Until you do you're just guessing at the bottleneck. Also, consider arranging the hardware to your advantage. If you can get a situation where you can read the file from one disk and write to a different disk you'll almost certainly see a big improvement in speed.

    -sam

Re: Performance Question
by ariels (Curate) on May 08, 2002 at 13:38 UTC

    Your problem isn't in the size of the buffers Perl reads. It rarely is. But I don't understand your system setup. How can you have 70% CPU idle and load average 1.25? The only way I can see that is if you have 2 processors, and a very weird version of <samp>top</samp>.

    Your numbers seem to suggest you're processing 142KByte/sec, which really isn't a lot. How fast can you copy that file on the same machine?

      Ariels wrote:
      How can you have 70% CPU idle and load average 1.25?

      Load average is the average number of processes waiting for a resource before they can continue running. CPU isn't the only resource: your load can spike if your processes are hammering IO as well.

      My guess is that, assuming your per-line transformations are relatively small, this problem is disk-bound. If that's the case, buffering would help. But there's only one sure way to find out: make some changes and see if it makes a difference.

      One possible scenario is: you have autoflush turned on on your input and/or output filehandles. Every time you read a line, the disk heads travel to one end of the disk. Every time you write a line, the disk heads travel all the way back to the other end of the disk. This isn't unthinkable if your files are large, or if your disk is full or fragmented. With more buffering, you'd decrease the number of disk accesses, and decrease the overhead associated with cross-disk head movement.

      Just a guess, but possibly worth checking out.

      Alan

      A relatively high load and low cpu usage is typical of disk thrashing in a software RAID setup.
Re: Performance Question
by roboslug (Sexton) on May 08, 2002 at 14:36 UTC
    I also have a hard time understanding how you can be throttled on perl runtime with only 1.25 load and 70% cpu idle.

    I am not in total agreement that it won't help to increase cache size...it is after all what caches are about...better I/O performance. Just not likely to save the day.

    At a minimum, I would have the output file be many files instead. Bad enough reading in 81GB, but simply propogating sillyness to make another one and will cut down runtime just on the output part.

    I need to think about it some, but maybe using the sysread and sysseek family would be faster since you bypass stdio. You would read in a huge chunk and then split that by LF.

    Another helpful option is to renice the program. Not friendly, but who needs friends.

    If you aren't using perl 5.6, maybe setting O_LARGEFILE would help...again...not sure. Ultimately, C/C++ will give you better performance, but I would rather play Everquest. :-)

Re: Performance Question
by moregan (Novice) on May 08, 2002 at 15:24 UTC

    Granted I don't know why you have both high idle and low i/o wait, but my eye is attracted to this part of your post:

    "...and prints the line back out to another file on the same disk."

    Is there any way you can have the input and the output files on separate volumes (or strings)? You might have faith in the machine's RAID setup (assuming there is one), but it's not guaranteed to save your bacon. Give it a try, even if the input (or output) has to be in a different part of a LAN.

      I'm not sure I'd consider 76% "high idle". As a matter of fact, if the one program in question is creating most of the 24% processor load and it's being run under a regular user account, it might be throttled for CPU utilization. Many Unix systems only only a regular user up to 10% or 15% of processor cycles.

      OTOH, you're definitely right to suggest using separate volumes. If the input file, output file, and swap partition are all on the same RAID logical disk or the same physical disk, it's probably a thrashing issue. I've seen a mail server increase performance by as much as 400% by simply moving the /var/log directory from a RAID volume to a separate disk. It hadn't been doing so well with logging, queueing, and spooling all on one array.

      If it's possible to run something like this as root and the machine loading down isn't an issue, then it's a good idea to run it as root. It's definitely a good idea to look at the disk subsystem, too. Perl solutions may help, too, but I'd look at system issues first in this case.

      Christopher E. Stith
      Do not try to debug the program, for that is impossible. Instead, try only to relaize the truth. There is no bug. Then you will find that it is not the program that bends, but only the list of features.
Re: Performance Question
by dws (Chancellor) on May 08, 2002 at 19:01 UTC
    I have some proof-of-concept code in Matching in Huge files that you might be able to adapt if your substitutions span lines and you want to do them large-chunk-at-a-time.

    Reading an 81Gb file in 8K chunks requires about 10.7 million reads. You can reduce that number by reading the file in larger chunks, via sysread().

    Another thing you might look at is whether part of the performance hit you're seeing has to do with disk. If you're writing to the same physical drive that you're reading from, the OS has to move the disk head a lot. This takes time that can add up. Doing writes in larger chunks (via syswrite()) should help, though writing to a separate disk is preferrable. Writing to a heavily fragmented drive will also add time.

Re: Performance Question
by jsegal (Friar) on May 08, 2002 at 15:26 UTC
    I will second the notion that this is really more a systems issue than a perl issue.
    Others have mentioned pre-sizing your destination file (which could help guarantee contiguous spacing on disk for the file), splitting the file into pieces and processing in parallel.
    But what caught my eye that no one else has responded to (yet) is that you are reading and writing the file from the same disk. This means that for every physical read and write (which will depend on O/S-level buffering) your disk needs to seek. If there is any way you can read from one disk and write to another you should see some speedup there, too.
    Best of luck,

    --JAS
Re: Performance Question
by roboslug (Sexton) on May 08, 2002 at 16:57 UTC
    Actually, mixing sysread with tachyon's buffer/partial line code (I did say mine sucked. :-)) is the fastest so far.

    Got it down to 2 wallclocks. Tachyon's implementation as written (chunk size and all) came in at 4 wall clocks. Didn't mess with chunk size however.

    However, even with removing his this/that, the output I am getting is slightly different, but must sleep. The clowns are coming for me.

Re: Performance Question
by hossman (Prior) on May 09, 2002 at 07:29 UTC
    While I have no direct wisdom to share, I will tell you which 2 questions immidiately came to mind when I read your post. Maybe they will shed some lite:

    1. You said: "monitoring the output, it looks like it will take 166 hours to run What are you baseing this number on? How are you doing this monitoring? are you sure that your method of monitoring the output file isn't flawed?
      One scenario I can easily imagine is if you are just checking the file's size X seconds after starting the program, dividing into 81GB, and multiplying by X. unless you turned on autoflush, maybe you just happened to check the file's size just before it was about to do a batch write -- completely skewing your estimate.

    2. Where is your code? A perl program like you describe sounds extremely simple, but whenever people post questions about programs that do things "simply" without posting any code, I tend to wonder what else is going on. What are "any needed changes" ? are you sure there isn't something else you are doing that's taking a lot of time?
Re: Performance Question
by rbc (Curate) on May 08, 2002 at 20:38 UTC
    I just recently wrote a script that was doing the same thing
    with a much smaller sized file yet it didn't run as fast I thought
    it should.

    Turns out I had a bug in the script.

    Here's the buggy script:
    ... my $tabCount = 0; my $line = ""; while(<>) { ... $tabCount += $#tabs+1; $line .= $_; if ( $tabCount == ENOUGH_TABS ) { print "$line\n"; $tabCount = 0; } }
    The bug is that I was not doing
    $line = "";
    when I was doing
    $tabCount = 0;
    thus printing a larger $line
    everytime and slowing things way down.
    I dunno but you might wanna make sure you don't
    have a bone head bugs like mine :)
    Good luck!
Re: Performance Question
by Anonymous Monk on May 09, 2002 at 03:21 UTC
    Fair warning, check how Perl was compiled.

    There is a real possibility that Perl doesn't have large file support compiled in, in which case your script will break 2 GB of the way through the file.

Re: Tie::File
by Revelation (Deacon) on May 09, 2002 at 00:43 UTC
    I would reccomend using Tie::File for this.
    Frankly, rewriting the whole file to change just a few lines is a problematic situation with flat files. Although I would prefer some sort of database backend, Tie::File gives a database, without any necessary fields, etc.

    You can read about Tie::File (Writen by Dominus) here, or go to cpan and download the module. Here's an example of how to edit a file from his article.
    tie @lines, 'Tie::File', 'file' or die ...; for (@lines) { # Do your thing, only for lines that match this regex, or ar +e records that need to be changed. } untie @lines;
    It should be quicker.
    Gyan Kapur
    gyan.kapur@rhhllp.com
      It might (dunno..) be quicker if your just changing a few dozen lines but if your trying to change *alot* of lines then its going tobe much, much slower.
Re: Performance Question
by tadman (Prior) on May 09, 2002 at 19:49 UTC
    The C function is called setvbuf, which will redefine how buffered data is read. For C programs, this can be a lifesaver. Give it a 1MB buffer and the OS can fetch a lot more data in a given read.

    See the IO::Handle setvbuf function, which should be documented on that page, complete with an example. The on-line docs seem to differ (i.e. are busted) compared to the real ones that ship with 5.6.1.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://165021]
Approved by rob_au
Front-paged by gmax
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others pondering the Monastery: (2)
As of 2024-04-19 01:53 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found