Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

Perl always reads in 4K chunks and writes in 1K chunks... Loads of IO!

by NeilF (Sexton)
on Jan 01, 2006 at 00:14 UTC ( #520201=perlquestion: print w/replies, xml ) Need Help??

NeilF has asked for the wisdom of the Perl Monks concerning the following question:

I've been doing some analysis of some of my perl scripts... Take this simple example reading/writing a 1 meg file (with about 3000 lines):-
#!/usr/bin/perl use Fcntl qw(:DEFAULT :flock); $|=1; print "Content-type:text/html;charset=ISO-8859-1\n\n"; open DF,"test.txt"; @test=<DF>; close DF; my $rec; foreach(@test){$rec.=$_;} sysopen (DF,"test.txt",O_WRONLY | O_CREAT); syswrite DF,$rec,length($rec); close DF; exit;
When I watch this running (XP Pro SP2) using File Monitor (by SysInternals) I can see the read generates a new IO process 4K at a time. Worse still, when writing, it generates an IO process for each 1 K... In total this simple operation generates over 1500 IO processes (in File Monitor).

Using standard open and print results in the same number of IO processes (in File Monitor). I've also tried the same thing running perl in CYGWin with File Monitor looking, and the same results are shown, 4K chunks are read, and 1K chunks are written.


Is there anyway around this? Am I monitoring it correctly? You can see by my example I've tried using the more exotic calls to try and stop this "buffering"...

Here's a link to an example output from FileMonitor showing the 4K chunks being read in. www.hvweb.co.uk/fawcettn/filemon4.gif (Rememember to maximise the image size)

UPDATE---- Seems when I test using CYGWin BINMOD makes NO improvement at all... So what right? I want to improve the software on my ISPs Unix machine, but get contradictory results from Active Perl and CYGWin on my XP system... What can I do :(
  • Comment on Perl always reads in 4K chunks and writes in 1K chunks... Loads of IO!
  • Download Code

Replies are listed 'Best First'.
Re: Perl always reads in 4K chunks and writes in 1K chunks... Loads of IO!
by snowhare (Friar) on Jan 01, 2006 at 02:10 UTC
    What do you get with:
    #!/usr/bin/perl local $/; open DF,"test.txt"; binmode DF; my $rec = <DF>; close DF; open (DF,'>test1.txt') || die ("Failed to open test1.txt: $!\n"); binmode DF; print DF $rec; close DF; exit;
    ?

    When I tested on Linux, I found this version ran 4 times faster than yours reading/writing a 10 megabyte file.

      Definate progress! With my example I'd get around 1611 io processes reported by File Mon. With yours I only get 600!

      Now, keep in mind I'm trying to read in records into an array which I would process and then write back. Your example does not do this so I changed it back to more like mine:-
      open DF,"test.txt"; binmode DF; my @test = <DF>; close DF; my $rec; foreach(@test){$rec.=$_;} open (DF,'>test1.txt') || die ("Failed to open test1.txt: $!\n"); binmode DF; print DF $rec; close DF;
      Now that worked fine, and I still only get 600 processes reported!

      Note: I had to remove the "local $/;". I've never seen that before and it seemed to mean the lines were not read in as an array?

      With further testing ALL the gains your version shows are in the write. If I replace your binary read with my original read I still get 600 io processes. Here's the current example code:-
      open DF,"test.txt"; my @test=<DF>; close DF; # Print first 3 elements to prove read as an array print "Line 1=$test[0]<BR>Line 2=$test[1]<BR>Line 3=$test[2]<BR>"; # Can we get around this? Waste of processing & memory! my $rec; foreach(@test){$rec.=$_;} open (DF,'>test.txt') || die ("Failed to open test2.txt: $!\n"); binmode DF; print DF $rec; close DF;
      So this is a major improvement in the write area, but the read is still (at least on my XP system) a big cause of most of the io processes as it's reading in 4K chunks!

      Also, having to build the array into another variable ($rec) to write eats up memory and processing. Is there a way to better print out that array as a single combined variable? (If I "PRINT DF @test" instead of $rec this increases the io figure from 600 to 841!)
      I initally thought this was STUNNING! I testing it on perl on my XP system and saw a dramatic reduction on IO processes shown on FileMon...

      HOWEVER, I thought I'd test it via CYGWin, and get NO difference/improvement when using Binmode...

      The software will run (at the end of the day) on my ISP's Unix system, so I need to be sure it will help on there...

      Now everything is back up in the air again with different results from two different perl platforms on my machine :(

      I'm very worried now that what my system is showing me is not a fair reflection of what will happen on Perl on Unix etc etc... ie: I could drastically reduce io processes shown by FileMon on my system, but not make the slightest difference to when the software is running on my ISP...
Re: Perl always reads in 4K chunks and writes in 1K chunks... Loads of IO!
by BrowserUk (Pope) on Jan 01, 2006 at 15:44 UTC

    You can slurp the file in one read and split it yourself:

    #! perl -slw use strict; my $file = 'test.txt'; open DF, '<:raw', $file or die "$file : $!"; my @test = split "\n", do{ local $/ = \ -s( $file ); <DF> }; close DF;

    However, if you are reading this file frequently, (like every time a web page is hit as suggested by your example), then you are probably worrying about the wrong thing. After the first time the file is read, it will be cached in the file system cache, so the second and subsequent times you read it, the 4k reads will be coming from cache. You can demonstrate this to yourself if you have a disk activity led on your machine. Run the above script and you should see the disk hit for a sustained period the first time. The second time and subsequent runs you may see a brief access but no sustained hit.

    Equally, whilst you may see many 1K calls to the system write api, these will frequently be cached in ram and written to disk asynchronously as the demands on the cache dictate. For example, the system may decide to write chunks out when the disk head is in approximately the correct position following disk activity by other processes. If you attempt to optimise the writing by your process, you could interfere with the dynamics of the overall system which could actually result in slower throughput. The very best way to ensure optimal IO for your process and throughput by the entire system is to increase the proportion of your ram that is devoted to the system cache.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      BrowserUk, thanks... Two questions/comments regarding your post.

      Wouldn't your code mean the lines are stripped of the line feeds they originally had? ie: When you came to write the array out it would not longer have the line feeds and you'd have to add them into every line?

      The area I'm looking at is where I'm posting a new message in a forum, which reads the forum in, manipulated the lines and then writes it back out. So this code is not used for general browsing, just when updating.


      I'll have a play with your example and see what the outcome is... You recon it reads it in one(ish) hit and not in horrible 4K blocks?

        The problem is, it is quite likely that your ISP is measuring your IO in terms of bytes read and written rather than the number of reads and writes, so reducing the latter is unlikely to satisfy them.

        Also, when you have read the entire file, there is no need to re-write the entire thing in order to add a new line. If you open the file for reading and writing, when you have read it, the file pointer will be perfectly placed to append any new line to the end. That will reduce your writes to 1 per new addition. If there is no new addition, they user is just refreshing, then you'll have no writes.

        Also, you presumably do not redisplay the entire forum each time, but rather only the last 20 or so lines?

        If this is so, then you should not bother to re-read the entire file each time, but rather use File::ReadBackwards to get just those lines you intend to display. If you do this, then you can use seekFH, 0, 2 to reposition the pointer to the eof and then append new lines without having to re-write the entire file each time.

        Using this method, you can fix the total overhead per invocation to (say) 20 reads and 0 or 1 writes. You'll need to deploy locking, but from your code above you seem to be already familiar with that.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        I just realised I completely ignored one of your questions.

        Wouldn't your code mean the lines are stripped of the line feeds they originally had?

        Yes, as I coded it the newlines would be removed. This would effectively do a free chomp @test;. I don't see this as a problem as it would cost very little to replace them when writing the lines out again.

        However, if you want them left in place, then you could use the following split instead.

        #! perl -slw use strict; my $file = 'test.txt'; open DF, '<:raw', $file or die "$file : $!"; my @test = split /(?<=\n)/, do{ local $/ = \ -s( $file ); <DF> }; close DF;

        All that said, if you are only appending to the end of the file, why read the file at all? Have you heard of opening a file for append?


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Perl always reads in 4K chunks and writes in 1K chunks... Loads of IO!
by Aristotle (Chancellor) on Jan 01, 2006 at 15:07 UTC

    Your screenshot contains lines mentioning a path that ends in \cgi\forum\test2.txt. Are you trying to write to write a forum script that stores posts in a flat textfile? That would be a bad idea, and no amount of microoptimising Perl’s I/O can make that fast.

    Makeshifts last the longest.

      Aristotle, yes it is for a forum...

      I wrote it 4-5 years ago and its been chugging along very nicely for all those years. It's not heavily used, ie: typically only 0-12 people using it at any one time...

      However, the io processing has been noticed by my ISP hence my attempts to reduce it.

      I've made a number of mods already that have generally improved/reduced io processes, but one of the biggest hogs is when postting a message.

      It seems at the least with writing:-
      a) Just by adding "binmode" prior to the write that will more than half the io. And this can be applied to all such writes to improve across the board!
      b) By consolidating an array into a variable that will further reduce the io. However, I don't like the idea of having a loop which goes through X thousand records, and having the data held twice (once in the array and once in a variable) at the same time!

      So my two questions are currently:-
      a) Is there a simple means of improving the reading, as has seemingly been found for the write. Reading in 4K chunks, blah!
      b) Is there a better means of printing the array as a consolidated variable, so the data is not held twice in memory etc.
        However, the io processing has been noticed by my ISP hence my attempts to reduce it.
        If your ISP is poking into it that much, it's either time to renegotiate with your ISP to make that a permitted activity, or get any one of the four million ISPs that won't care.

        Most ISPs charge for network bandwidth and disk storage, and would never care how many times you call read() or write().

        -- Randal L. Schwartz, Perl hacker
        Be sure to read my standard disclaimer if this is a reply.

Re: Perl always reads in 4K chunks and writes in 1K chunks... Loads of IO!
by serf (Chaplain) on Jan 01, 2006 at 15:47 UTC
    If you are concerned about the performance while reading the file into an array I would recommend doing some sample code and running it against a large file (say 100MB or more) to test the speed difference between doing:
    @array = <FILE>;
    and doing:
    while(<FILE>) { push(@array, $_); }
    and run it multiple times to make sure you're not just getting the effect of the file being cached in memory.

    I have found that while the version with the while loop *looks* longer it actually has always run faster in the tests I've done.

    PS: a die message with your open statement like

    open (DF, "test.txt") || die "Can't read 'test.txt': $!\n"
    is your friend, as are:
    use strict; use warnings;
    at the top of your script - I'd recommend using them - they will help you by saving you time finding what's causing errors and in the long run should also help you to write better code by teaching you good habits.

    :o)

    update: Thanks ChOas - I've fixed it. I always use the () brackets and || myself - and vaguely recalled (like you point out) that there *was* a difference between || and 'or'.

    After having Dominus do a presentation to us the other week and finding I am in th habit of using () where I don't absolutely need to, I'd thought I'd not add them here where NeilF wasn't already using them... I've put them back on now :o)

    running:

    perl -MO=Deparse -e 'open (DF, "test.txt") || die "Cant read test.txt\ +n";'
    tells me I *could* write it:
    die "Can't read test.txt\n" unless open DF, 'test.txt';
    but I won't :o)
      This:

      open DF, "test.txt" || die "Can't read 'test.txt': $!\n"

      Does not do what you think it does.

      The || ties itself to "test.txt", which is always true, and not to the return of the open.

      This:
      open(DF, "test.txt") || die "Can't read 'test.txt': $!\n"

      or:

      open DF, "test.txt" or die "Can't read 'test.txt': $!\n"

      (or binds less tight than ||)

      Would accomplish what you want.


      GreetZ!,
        ChOas

      print "profeth still\n" if /bird|devil/;
        Why are you measuring under Windows to see what will happen on Unix?
        If you've only got one machine to play with, why not boot off a LiveCD (like Knoppix) and measure your code (or a key subset) under Linux?
        Might not be the same OS your ISP is using, but closer to Unix than Windows?
        Might make absolutely no difference, but at least you might be a bit closer to comparing apples to apples rather than apples (Unix) to oranges (Windows) ...

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://520201]
Approved by Happy-the-monk
Front-paged by sk
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others exploiting the Monastery: (5)
As of 2019-07-17 14:23 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?