http://www.perlmonks.org?node_id=1075884

Peterpion has asked for the wisdom of the Perl Monks concerning the following question:

Hi monks,

In relation to a question I posted a couple of days ago theres a quite small question I have which I thought was better asked as a new one. I execute an external program to generate input data for my program and I've been wondering about scalability and whether I am reading the data into the program in the best way (well in fact I know I am not but I just wonder what possibilities are out there).

I use nfdump to generate a dump of fragments of the network 'flows' which I read into an array with split when the program originally executes. A flow is a term for a network connection from start to end, with source and destination IP, bytes and a few other bits of info associated. Currently I am just using backticks to execute this command and theres no memory problem with this currently but what if I had much more data? My program grinds the data down to mere traces in comparison (ie a few meg).

Whats the best way to read in a very large amount of data from an external program - is there anything which could allow a text output of say 100GB to not choke a system (i'm thinking piping the data in, system etc). I think not and AFAIK the only way to do this would be to modify the nfdump command to pause execution when blocked by my program (or use files but I prefer not to really). I wonder if its possible to block execution of the external program without modifying it.

In case I have not been clear, what I mean is in a perl program I write which calls an external program which generates (say) 100GB of data which is then read line by line into my program. I believe it will choke as it fills the OS buffers with that 100GB before I start reading it in line by line (and processing it). Is there a way to make the external program pause?

In real life I would process smaller chunks of data at a time but there could be a case when its desirable to read in a huge amount of data. It could be read from disk by the program generating input for my program and it could easily be written to disk before slurping it in to my program but is there a way to block execution of an external program? Since the system can pause a process I would imagine there is at least one way but using signals to pause a process which is generating input seems potentially fraught with deadlock complexity etc. Its perhaps a slightly theoretical question but one which I find quite interesting so the musings of the wise ones would be highly appreciated :-)

Replies are listed 'Best First'.
Re: Blocking execution of called programs to reduce buffer size
by moritz (Cardinal) on Feb 23, 2014 at 12:06 UTC
    Whats the best way to read in a very large amount of data from an external program

    Pipes! See for example perlopentut for some piping examples.

    I think not and AFAIK the only way to do this would be to modify the nfdump command to pause execution when blocked by my program (or use files but I prefer not to really). I wonder if its possible to block execution of the external program without modifying it.

    Writing to a pipe whose buffer size is exceeded does in fact block, so unless the program that writes stuff takes special care to do non-blocking writes, pipes already do what you want.

    Isn't that wonderful?

Re: Blocking execution of called programs to reduce buffer size
by Laurent_R (Canon) on Feb 23, 2014 at 12:31 UTC
    Yes, using pipes is the right way. I am doing that quite frequently, for example sorting a huge file with the shell sort command and redirecting the sorted output to a Perl program through a pipe. At the beginning, sort is not writing anything out (it has to process at least in part the whole file before it can start printing out anything); during that phase, the Perl program is just sitting idle, waiting for data to start to come. At a later point, the sort command might be spitting out data lines faster than the Perl program would be able process them, but it is OK because the pipe pauses the sort to enable the Perl program to consume the data at its own speed.
      Ahh I see (and mentally it all falls into place for me)! I don't know why this didn't occur to me before since I've written socket code in perl before and paid attention to blocking (and used non blocking sockets) but never really paid attention to file system blocking. I do remember reading a long time ago that pipes behaved in this way but I have sometimes come across incidents when I have used pipes in the unix shell when i've got an error which my recollection seems to suggest was related to out of memory. But I guess it all depends on both the exact way a pipe is being used and the particular commands which are being piped. Try to pipe a program / command which has to write in non blocking mode to something which does not slurp data in without hesitation will I guess cause an error if the pipe overflows. Use a command / program which uses a blocking write and all will work nicely.

      So all I need to do is check my source program is opening the file in blocking mode, and if its not, simply change the opening mode to blocking and 'pausing' should automagically fall into place. I feel a bit of a twerp for having to ask this but hey ho thanks for being gentle!

      Cheers, Pete