Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

Suggestions for optimizing this code...

by NeilF (Sexton)
on Jan 16, 2006 at 22:14 UTC ( #523624=perlquestion: print w/replies, xml ) Need Help??

NeilF has asked for the wisdom of the Perl Monks concerning the following question:

I want to read in thousands of records into an array and then process them. I'm not particularly interested in that processing of the array, but instead I'm interesting in optimising the reading of the "\n" seperated data into an array (via sysread), and then the writing of it all out again.

As the data could be a couple of meg big I want as few occurences of the data being process/used as possible...

I have to have it in an array, and I have to using sysread and syswrite to read/write the record in one go cos my ISP counts IO processes...

Thanks in advance for any help...

my $rec; sysopen(DF,"test.txt", O_RDONLY | O_CREAT); sysread(DF, $rec, -s DF); close DF; # Split up records into array. # Will lose \n on recs - add later. @test=split("\n",$rec); # Do some work on @test Work... Work... Work... # Build up into a single record, putting \n back in my $rec;foreach(@test){$rec.=$_."\n";} sysopen(DF,"test.txt", O_WRONLY | O_CREAT); syswrite DF,$rec; close DF;

note: This has been a little bit covered in a previous thread but someone suggested starting a new thread just for this optimisation... So here it is :)

Replies are listed 'Best First'.
Re: Suggestions for optimizing this code...
by Roy Johnson (Monsignor) on Jan 16, 2006 at 22:19 UTC
    There's no optimization to be done as far as slurping and spewing the file. A few minor points about the rest of it:
    If you don't want to lose the newlines when you split, make it
    @test = split /(?=\n)/, $rec;
    Look at join for turning the array back into a single string.

    From the code you post, there's no reason to declare $rec a second time.


    Caution: Contents may have been coded under pressure.
      In tests (on Perl on XP) splitting on "(?=\n)" is far far slower than a regular split on just "\n"...
      Is there no way to not use $rec and instead to split the record up within the sysread itself? eg: Something akin to:-

      sysread(DF, split(/(?=\n)/,$[0]), -s DF);

      I'm sure that's jibberish, but you get the idea...

        No. sysread is there for when you want large chunks of data and need to read them unbuffered. Noone does that when they need to process data line-wise – well, unless they are forced to operate under cartoonishly arbitrary and nonsensical constraints like “minimise the number of syscalls.”

        Seriously, what your hoster is asking of you makes no sense at all.

        Makeshifts last the longest.

Re: Suggestions for optimizing this code...
by Tanktalus (Canon) on Jan 16, 2006 at 22:33 UTC

    Your ISP counts IO processes??? How about RAM usage? And what the heck is an "IO process"? And how can your ISP tell?

    Generally speaking, reading a record, working with it, saving it, and moving to the next record is usually more CPU/RAM friendly (you can work on stuff while the next chunk is still coming off disk, and while the last chunk is still going to disk). Trying to count the number of IO calls you make is entirely short-sighted of your ISP, if that is indeed what they're doing.

    As has been mentioned, you can use /(?=\n)/ for your split. In the reverse, you should use join to, well, join them back together. join is the opposite of split:

    @test = split /\n/, $rec; # ... work ... $rec = join "\n", @test;
    In this case, you are splitting (and thus removing) \n's. In the next case, based on Roy Johnson's advice, you are removing nothing, so you join with nothing:
    @test = split /(?=\n)/, $rec; # ... work ... $rec = join '', @test;
    Hope that helps.

      Is there not a way to not use $rec and instead to split the record up within the sysread itself? eg: Something akin to:-

      sysread(DF, split(/(?=\n)/,$[0]), -s DF);

      I'm sure that's jibberish, but you get the idea...
Re: Suggestions for optimizing this code...
by BrowserUk (Patriarch) on Jan 17, 2006 at 04:00 UTC

    This assumes that you don't need random access to the lines in the file, but are only using the array for processing the lines, whilst achieving your need to read and write the file as single operations. It also requires 5.8.x.

    The following loads, processes and writes a 14MB/1 million line file and consumes a total of 30MB (essential 2x filesize) in 3 seconds.

    For comparison, your original code using the array and performing the same (m[ ]) operating on each line consumes 146MB and takes 14 seconds.

    my $rec; sysopen(DF, $ARGV[0], O_RDONLY | O_CREAT) or die "$ARGV[0] : $!"; sysread(DF, $rec, -s DF); close DF; open IN, '<', \$rec or die $!; open OUT, '>', \my $out or die $!; seek OUT, length( $rec )-1, 0; print OUT ' '; seek OUT, 0, 0; while( <IN> ) { ## Do stuff to this line in $_; m[ ]; print OUT; } sysopen(DF,"test.txt", O_WRONLY | O_CREAT); syswrite DF, $out; close DF;

    What the code does is to open the scalar into which you slurped the file as a in-memory filehandle. It also opens and pre-sizes an output in-memory filehandle. You then read from the in 'file' and write to the out 'file' one line at a time in the normal way and once you've finished, you write the outfile, which is really just a second huge scalar ($out) to the real output file in a single spew.

    Avoiding creating the array saves both a substantial amount of memory and time.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Suggestions for optimizing this code...
by Aristotle (Chancellor) on Jan 17, 2006 at 05:55 UTC

    First off: I’d simply switch to a less bozotic provider as soon as possible.

    That said, for the principle of it,

    my $rec; sysopen my $fh, "test.txt", O_RDWR | O_CREAT or die "$!"; sysread $fh, $rec, -s $fh; my $offs = 0; while( $offs < length $rec ) { my $next_eol_offs = index $rec, "\n", $offs; $next_eol_offs = length( $rec ) - 1 if $next_eol_offs == -1; my $str = substr $rec, $offs, $next_eol_offs - $offs + 1; # work on $str; note that it includes the newline substr( $rec, $offs, $next_eol_offs - $offs + 1 ) = $str; $offs += length $str; } seek $fh, 0, 0; syswrite $fh, $rec; close $fh or die "$!";

    Note that there are error checks in here that your own code did not include.

    This will execute as few “I/O processes” as possible and consume as little memory as possible (well, it could consume a tiny bit less if you use substr as an lvalue instead of making a copy, but that is fraught with bugs), but at the cost of stupidly high CPU consumption and convoluted code. Doing it in a more natural way would consume minimal CPU and memory resources and do no more I/O than this way does, only it would stretch the I/O over more “I/O processes.”

    I can’t imagine why any hoster would think forcing their customers to burn a ton of CPU is a good idea, unless either their tech dep’t is clueless (their use of the term “I/O process” makes me inclined to assume this) or their storage subsystem is seriously under-budgeted, so they’re forcing their customers to rewrite their code in harder to maintain fashion to evade the I/O bottleneck by trading CPU time for I/O. (But even that is a far-fetched explanation, and I’m not sure if doing the same amount of I/O in fewer syscalls would be any help. I vote “clueless.”)

    Roughly.

    Whatever the case, I’d run away from them instead of making my code a damn sight unreadable.

    Update: fixed code per BrowserUk’s reply below.

    Makeshifts last the longest.

      Using your code above as is with the exceptions of adding a shebang line and using $ARGV[0] for the filename and it crunches with

      P:\test>523624-a 1000000.dat syswrite() on closed filehandle $fh at P:\test\523624-a.pl line 26. Bad file descriptor at P:\test\523624-a.pl line 26.

      Which I do not understand at all. Any thoughts?

      Update: Ok. I couldn't see it for looking, but you have a comma instead of a semicolon on the syswrite line.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Suggestions for optimizing this code...
by BrowserUk (Patriarch) on Jan 16, 2006 at 23:50 UTC

    Do you need random access to the lines or are you just going to iterate over the array once from beginning to end?


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      I do need that array... Maybe what I'll do is read buffered, but write using syswrite. All I'd have to do then is replace:-

      open DF,">test.txt"; print DF @test; close DF;


      with:-

      sysopen(DF,"test.txt", O_WRONLY | O_CREAT); syswrite DF,join('', @test); close DF;


      At least then the IO Processes for writing are down to 1!
Re: Suggestions for optimizing this code...
by Sioln (Sexton) on Jan 17, 2006 at 06:56 UTC
    open DF, 'test.txt'; #open for reading my @test=<DF>; #$test[0] - first string, $test[1] - second.. close DF; #clear # Do some work on @test Work... Work... Work... open (DF,'>test.txt');#open for writing print DF @test;#flush array into file close DF;#clear
      Unfortunately (as mentioned in my original post) I have to use sysread & syswrite. These only basically generate one IO process. In the case of say a one meg file reading and writing using buffered IO will use 2000 IO process each way...

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://523624]
Approved by ww
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others romping around the Monastery: (7)
As of 2022-05-23 00:05 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Do you prefer to work remotely?



    Results (81 votes). Check out past polls.

    Notices?