Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Increasing the write buffer

by accassar (Initiate)
on Jun 21, 2008 at 04:43 UTC ( #693249=perlquestion: print w/ replies, xml ) Need Help??
accassar has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I have a situation where I have to write to 30,000 simultaneous files based on the contents of X files. So I open the file:
open ${fileno}, ">>$filename";
I write to the file:
printf ${fileno} "%s\n", $myline;
The problem is - with so many simultaneous writes going to many different files my disks are thrashing. So what I want to do is increase the write buffer per file handle, and only flush when I either fill that buffer or have X lines in the buffer.

Comment on Increasing the write buffer
Select or Download Code
Re: Increasing the write buffer
by BrowserUk (Pope) on Jun 21, 2008 at 05:14 UTC
    Hi, I have a situation where I have to write to 30,000 simultaneous files

    And you've actually succeeded in opening 30,000 files concurrently? Because most systems have a limit on the number of open file handles, that is usually far lower than that.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Increasing the write buffer
by ikegami (Pope) on Jun 21, 2008 at 05:34 UTC
    You could do your own buffering.
    my %bufs; sub buf_print { my $fh = shift @_; $bufs{$fh} .= join($,, @_) . $\; my $to_write = int( length($bufs{$fh}) / $buf_size ) * $buf_size; syswrite($fh, substr($bufs{$fh}, 0, $to_write, '')) if $to_write; } sub buf_flush { my $fh = shift @_; my $to_write = length($bufs{$fh}); syswrite($fh, substr($bufs{$fh}, 0, $to_write, '')) if $to_write; }

    The above isn't nicely packaged, but you get the idea.

    If you're writting the same tihng to every handle, you can even use a single buffer instead of a buffer per handle. Efficient!

Re: Increasing the write buffer
by accassar (Initiate) on Jun 21, 2008 at 08:22 UTC
    Opening 30K+ files is not a problem. I was hoping for some hidden mechanism to better control the output buffering - however your idea of manually controlling it is not a bad one. I heard that when writing "\n" to an output stream it forces a flush? Can anybody confirm this?
        And just as a matter of curiousity, what OS so generously gives you 30K + X simultaneously open file_handles?

        Linux probably does. I've never actually tried it, but if I cat /proc/sys/fs/file-max, I get a number like 300,000. I'm assuming the usual lower limits are probably set by the shell on login... ulimit -n shows 1024 for me.

        I imagine, after a certain point, having that many open file handles is counter productive, but I could be wrong.

        -Paul

      There are three buffering states:

      • unbuffered
      • block buffered: Writes when 4KB* is accumulated.
      • line buffered: Writes when \n is encountered or 4KB* is accumulated.

      If a handle is buffered, line buffering is always and only used when the file handle is connected to a terminal.

      So, if you're writing to a file, \n won't flush. But if you're writing to STDOUT and it hasn't been redirected, \n will flush if buffering wasn't turned off.

      You can turn on buffering using

      use IO::Handle qw( ); FH->autoflush(0);

      You can turn off buffering using

      use IO::Handle qw( ); FH->autoflush(1);

      You can flush manually using

      use IO::Handle qw( ); FH->flush();

      * — Well, I *think* the buffer size is 4KB.

      Writing a "\n" to the disk does not force a flush. From perlvar:

      STDOUT will typically be line buffered if output is to the terminal and block buffered otherwise.
      Autoflush only applies if select() has been called to connect the handle to STDOUT and STDOUT is directed to the terminal.

      In general everything you write to the disk is greedily buffered by the OS unless and until the OS runs out of buffer-cache, at which point everything slows to a crawl.


      s//----->\t/;$~="JAPH";s//\r<$~~/;{s|~$~-|-~$~|||s |-$~~|$~~-|||s,<$~~,<~$~,,s,~$~>,$~~>,, $|=1,select$,,$,,$,,1e-1;print;redo}
Re: Increasing the write buffer
by apl (Monsignor) on Jun 21, 2008 at 13:22 UTC
    Unless you need to write to each of the 30,000 files after each line of input, I'd have @array[30000][100], and push to each relevant array after input.

    Once you have 100 records associated with an output file, I'd

    • open the file to append
    • write the 100 records
    • flush and close the file
    • recycle
Re: Increasing the write buffer
by roboticus (Canon) on Jun 21, 2008 at 13:56 UTC
    accassar:

    This actually sounds more like a job for syslogd. You have a task that generates the data, and uses the logging facility to broadcast it. Then you could put a few different computers on your network, each monitoring the log and filtering out what they want, writing the data to the results files.

    Of course, it's quite likely that you won't get syslogd to write the data in exactly the format you want. So in that case, you could simulate it yourself easily enough with sockets. Just figure out what information your clients will need to decide which file(s) to write the data to, then box up your data with a prefix containing that decisioning data. Your clients can then read a configuration file to decide how to determine what data blocks go to which files.

    I don't know POE, I've never used it ... but from what I've heard here, it sounds like an interesting way to get started...

    ...roboticus
Re: Increasing the write buffer
by pc88mxer (Vicar) on Jun 21, 2008 at 15:10 UTC
Re: Increasing the write buffer
by starbolin (Hermit) on Jun 21, 2008 at 17:33 UTC

    I'm having a hard time convincing myself that increasing the size of the writes would have any effect at all as the limiting factor is the number of dirty pages that the vm manager is willing to allow. Once the vm manager starts flushing dirty pages to free up pages, disk writes are going to become synchronous.



    s//----->\t/;$~="JAPH";s//\r<$~~/;{s|~$~-|-~$~|||s |-$~~|$~~-|||s,<$~~,<~$~,,s,~$~>,$~~>,, $|=1,select$,,$,,$,,1e-1;print;redo}
      Looking back in the archives I found that the IO buffer can indeed be changed - however it requires a recompilation of perl.

      After modifying perlio.c and doubling the buffer size - what used to take 3 hours now only takes 2. I'll play with it some more to see what I can get out of it.

      Thanks for all your comments.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://693249]
Approved by Sinistral
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chilling in the Monastery: (12)
As of 2014-12-22 20:10 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (128 votes), past polls