|Keep It Simple, Stupid|
Re: Is foreach split Optimized?by sundialsvc4 (Abbot)
|on Jul 09, 2017 at 11:12 UTC||Need Help??|
To minimize actual handling of memory ... which can be punishingly speed-reducing if there is actually any sort of contention for memory (e.g. on a production versus a development machine, and where the strings in question might be large) ... I normally use the index function with the third, (starting...) position, argument. Simply look for the next occurrence of a "\n" (newline), initially starting at position zero. Extract that substring out of the string into a new variable, without altering that string, and update the position variable so that the next loop iteration begins its search from that position rather than at the start of the string. Rinse and repeat. The original string should remain the same, and we simply extract a copy of the next piece of it, one piece at a time, using a while loop.
Source-code example is left as an exercise to the reader. (It isn’t difficult.) Windows vs. Linux files have different line-ending sequences but the approach is the same.
If the strings are originating from a file, I also normally do not “slurp” the entire file ... neither do I memory-map it. (But, see below.) I just read it, say, 100K bytes at a time. The only implication of this technique is that the final string extracted is probably incomplete (UTF: see below), but this case is easy to identify because index fails to find the newline. A new buffer string is created to replace the old one (somehow usually faster than modifying the old one, if it is large), containing the remaining characters. Then, another chunk is read from the file and appended to it, and the rinse-and-repeat continues. No matter how big the file might be, the memory footprint of the application will remain small and predictable. Hence, performance probably will too. (“If the file is, say, ten times as large, it probably will take about ten times as long.”)
It is worth mentioning here that this deliberately memory-conservative approach is for me a “lesson learned” from dealing with scripts in various languages that ran just fine on the developer’s of-course-beefy hardware, but which caused time-robbing thrashing to occur on busy production boxes. This is less of a concern today than it used to be, but still a concern. (If a computer system does begin to thrash, the consequences are an exponential, not linear, degradation of completion times. The effects are catastrophic not only for the causing program but for the entire system.)
If you are splitting a file that might contain UTF-8 or UTF-16 characters, you should turn off UTF processing since any read of a chunk of data from the file might by chance end with an incomplete sequence. (The remainder of the sequence will be read next time, but not until then.) You do not want this to throw an exception.
Memory-mapping can of course be used, but it tends to produce a paging-system impact that is noticeably different than that from usual virtual-memory operations, and the I/O requests for this purpose can be treated differently by the O/S scheduler. In my experience, ordinary file-reads work just as well, and the number of bytes read in each chunk can be modest due to anticipatory sequential file-read scheduling which typically occurs.
It should go without saying that you would adopt this technique only when you do not otherwise need to accumulate a significant portion of the file data in memory for other reasons.