Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid

Re: Is foreach split Optimized?

by sundialsvc4 (Abbot)
on Jul 09, 2017 at 11:12 UTC ( #1194616=note: print w/replies, xml ) Need Help??

in reply to Is foreach split Optimized? (Update: No.)

To minimize actual handling of memory ... which can be punishingly speed-reducing if there is actually any sort of contention for memory (e.g. on a production versus a development machine, and where the strings in question might be large) ... I normally use the index function with the third, (starting...) position, argument.   Simply look for the next occurrence of a "\n" (newline), initially starting at position zero.   Extract that substring out of the string into a new variable, without altering that string, and update the position variable so that the next loop iteration begins its search from that position rather than at the start of the string.   Rinse and repeat.   The original string should remain the same, and we simply extract a copy of the next piece of it, one piece at a time, using a while loop.

Source-code example is left as an exercise to the reader.   (It isn’t difficult.)   Windows vs. Linux files have different line-ending sequences but the approach is the same.

If the strings are originating from a file, I also normally do not “slurp” the entire file ... neither do I memory-map it.   (But, see below.)   I just read it, say, 100K bytes at a time.   The only implication of this technique is that the final string extracted is probably incomplete (UTF: see below), but this case is easy to identify because index fails to find the newline.   A new buffer string is created to replace the old one (somehow usually faster than modifying the old one, if it is large), containing the remaining characters.   Then, another chunk is read from the file and appended to it, and the rinse-and-repeat continues.   No matter how big the file might be, the memory footprint of the application will remain small and predictable.   Hence, performance probably will too.   (“If the file is, say, ten times as large, it probably will take about ten times as long.”)

It is worth mentioning here that this deliberately memory-conservative approach is for me a “lesson learned” from dealing with scripts in various languages that ran just fine on the developer’s of-course-beefy hardware, but which caused time-robbing thrashing to occur on busy production boxes.   This is less of a concern today than it used to be, but still a concern.   (If a computer system does begin to thrash, the consequences are an exponential, not linear, degradation of completion times.   The effects are catastrophic not only for the causing program but for the entire system.)

If you are splitting a file that might contain UTF-8 or UTF-16 characters, you should turn off UTF processing since any read of a chunk of data from the file might by chance end with an incomplete sequence.   (The remainder of the sequence will be read next time, but not until then.)   You do not want this to throw an exception.

Memory-mapping can of course be used, but it tends to produce a paging-system impact that is noticeably different than that from usual virtual-memory operations, and the I/O requests for this purpose can be treated differently by the O/S scheduler.   In my experience, ordinary file-reads work just as well, and the number of bytes read in each chunk can be modest due to anticipatory sequential file-read scheduling which typically occurs.

It should go without saying that you would adopt this technique only when you do not otherwise need to accumulate a significant portion of the file data in memory for other reasons.

Replies are listed 'Best First'.
Re^2: Is foreach split Optimized?
by jdporter (Canon) on Jul 12, 2017 at 09:30 UTC
    Source-code example is left as an exercise to the reader

    Of course it is. The heat death of the universe will occur before we see any working code from you.

    I reckon we are the only monastery ever to have a dungeon stuffed with 16,000 zombies.
Re^2: Is foreach split Optimized?
by karlgoethebier (Monsignor) on Jul 09, 2017 at 14:53 UTC
    "Source-code example is left as an exercise to the reader."

    OK, but it's untested:

    #!/usr/bin/env perl package MrNatural; use feature qw(say); use Class::Tiny { mantra => qq(Om Anwha Tanas Siam) }; package Enlightment; use base qw(MrNatural); use strict; use warnings; my $meditation = Enlightment->new(); for ( 1 .. 10000 ) { say $meditation->mantra; sleep 5; } __END__

    «The Crux of the Biscuit is the Apostrophe»

    perl -MCrypt::CBC -E 'say Crypt::CBC->new(-key=>'kgb',-cipher=>"Blowfish")->decrypt_hex($ENV{KARL});'Help

Re^2: Is foreach split Optimized?
by Anonymous Monk on Jul 09, 2017 at 13:59 UTC
    >Source-code example is left as an exercise to the reader.
    Stop posting shit. Stop talking shit. You can't do anything. You know Jack Shit.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1194616]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others about the Monastery: (3)
As of 2019-07-24 02:13 GMT
Find Nodes?
    Voting Booth?
    If you were the first to set foot on the Moon, what would be your epigram?

    Results (32 votes). Check out past polls.