pedrete has asked for the wisdom of the Perl Monks concerning the following question:
HI monks..
I have searched for posts but not found anything...
Is there a faster way of splitting a string into an array based on carriage returns?
Currrently i am using
my @array = split (/\n/,$HugeString);
Thanks....
Re: Fastest split possible
by davido (Cardinal) on Nov 06, 2019 at 20:07 UTC
|
Where did the string come from in the first place?
If the carriage returns are of a predictable format (always \n, or always \r\n) you can use index to find the next one. This will add code complexity, and pull more work into your code, and less out of Perl's underlying C implementation. But it will avoid invoking the regex engine. Which one wins would be up to how you code it, and also really up to that tradeoff of doing more work in Perl versus doing more work in perl.
On the other hand, if you're reading this big string in as a file, your record separator $/ and <$fh> should handle it.
That leads to one additional approach:
open my $memfh, '<', \$string;
chomp(my @array = <$memfh>);
Again not sure that will be faster, but an in-memory filehandle to a scalar is a nice little trick. It definitely puts more of the work into perl and less into Perl.
Update: I've since done my own benchmarks. The ones shown elsewhere in this thread are indicative of the results I was getting. The gist is that it's pretty hard to beat split. Keep in mind though, that if you're splitting a huge string into an array it's possible you're blowing up memory, and this would happen whether you use split, or some other solution. Back to my solution of opening an in-memory handle; it seems to be slower than split considerably. It my be neat using the technique to get an easy iterator for a string, but it's not going to be faster than split.
| [reply] [d/l] [select] |
|
c:\@Work\Perl\monks>perl -wMstrict -le
"my $consonants = qr{ [^aeiouAEIOU]+ }xms;
;;
my $s = qq{four score\nand seven\nyears ago\nour fathers\n};
print qq{>$s<};
;;
open my $fh, '<', \$s or die qq{opening in-memory: $!};
;;
process() while <$fh>;
;;
sub process {
chomp;
s{ ($consonants) }{\U$1}xmsg;
print qq{>$_<};
}
"
>four score
and seven
years ago
our fathers
<
>FouR SCoRe<
>aND SeVeN<
>YeaRS aGo<
>ouR FaTHeRS<
Give a man a fish: <%-{-{-{-<
| [reply] [d/l] [select] |
Re: Fastest split possible
by afoken (Chancellor) on Nov 06, 2019 at 20:52 UTC
|
Is splitting at newlines really your bottleneck? Or are you microoptimizing irrelevant parts of your code?
Try Devel::NYTProf, generate a report (run nytprofhtml) and look for code that is highlighted in red, because it is slow or runs very often. That's where you want to optimize for speed. (See also https://www.slideshare.net/Tim.Bunce/nyt-prof-201406key) Everything else should be optimized for maintainance, i.e. keep your code clear, readable, and documented.
Alexander
--
Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)
| [reply] [d/l] |
Re: Fastest split possible
by marioroy (Prior) on Nov 07, 2019 at 00:29 UTC
|
Update: Added regex example.
Hello all :) I tried trimming the left side of the string. Plus incorporated the memfh example by davido. I'm curious too.
Split improves ~ 2x faster for this demonstration with Perl 5.22.x and later releases. That is the case on macOS.
use strict;
use warnings;
use Time::HiRes 'time';
my $huge_string = "aaa bbb\nccc ddd\neee fff\nggg hhh\niii jjj\nkkk ll
+l\nmmm nnn\n";
# concatenate string exponentially to 917,504 lines
$huge_string .= $huge_string for 1..17;
# memfh
{
my $string = $huge_string;
my $start = time;
open my $memfh, '<', \$string;
my @lines = <$memfh>;
close $memfh;
printf "duration memfh: %0.3f seconds\n", time - $start;
printf "%d lines\n\n", scalar(@lines);
}
# regex
{
my $string = $huge_string;
my $start = time;
my @lines;
while ( $string =~ /([^\n]+\n)/mg ) {
my $line = $1; # save $1 to not lose the value
push @lines, $line;
}
printf "duration regex: %0.3f seconds\n", time - $start;
printf "%d lines\n\n", scalar(@lines);
}
# split
{
my $string = $huge_string;
my $start = time;
my @lines = split(/\n/, $string);
printf "duration split: %0.3f seconds\n", time - $start;
printf "%d lines\n\n", scalar(@lines);
}
# trim
{
my $string = $huge_string;
my $start = time;
my @lines;
while ( my $line = substr($string, 0, index($string, "\n") + 1, ''
+) ) {
push @lines, $line;
}
printf "duration trim : %0.3f seconds\n", time - $start;
printf "%d lines\n\n", scalar(@lines);
}
Output - Perl 5.28.2
duration memfh: 0.384 seconds
917504 lines
duration regex: 0.387 seconds
917504 lines
duration split: 0.067 seconds
917504 lines
duration trim : 0.201 seconds
917504 lines
Another machine - Perl 5.26.1
duration memfh: 0.477 seconds
917504 lines
duration regex: 0.445 seconds
917504 lines
duration split: 0.065 seconds
917504 lines
duration trim : 0.259 seconds
917504 lines
Same machine - Perl 5.18.2
duration memfh: 0.530 seconds
917504 lines
duration regex: 0.490 seconds
917504 lines
duration split: 0.130 seconds
917504 lines
duration trim : 0.261 seconds
917504 lines
Regards, Mario | [reply] [d/l] [select] |
|
Update 1: Added walk2 by pushing to stack.
Update 2: The string length now obtained inside C.
Update 3: Changed from for loop to while loop.
Ah, I tried Inline::C. Walk1 runs faster than split in older Perl <= 5.20.x.
...
use Inline 'C' => <<'END_C';
#include <stdlib.h>
#include <string.h>
SV * c_walk1(SV *sv)
{
STRLEN len;
const char *s = SvPV_const(sv, len);
const char *m = s, *strend = s + len;
AV *ret = newAV();
while (s < strend) {
if (*s++ == '\n') {
av_push(ret, newSVpvn(m, s - m));
m = s;
}
}
return newRV_noinc((SV *) ret);
}
void c_walk2(SV *sv)
{
STRLEN len;
const char *s = SvPV_const(sv, len);
const char *m = s, *strend = s + len;
Inline_Stack_Vars;
Inline_Stack_Reset;
while (s < strend) {
if (*s++ == '\n') {
Inline_Stack_Push(sv_2mortal(newSVpvn(m, s - m)));
m = s;
}
}
Inline_Stack_Done;
}
END_C
# walk1 - receive array ref
{
my $string = $huge_string;
my $start = time;
my $lines = c_walk1($string);
printf "duration walk1: %0.3f seconds\n", time - $start;
printf "%d lines\n\n", scalar(@$lines);
}
# walk2 - receive list
{
my $string = $huge_string;
my $start = time;
my @lines = c_walk2($string);
printf "duration walk2: %0.3f seconds\n", time - $start;
printf "%d lines\n\n", scalar(@lines);
}
Output from Perl 5.20.3 and Perl 5.26.1
$ /opt/perl-5.20.3/bin/perl demo.pl
duration memfh: 0.526 seconds
917504 lines
duration regex: 0.448 seconds
917504 lines
duration split: 0.110 seconds
917504 lines
duration trim : 0.268 seconds
917504 lines
duration walk1: 0.073 seconds
917504 lines
duration walk2: 0.110 seconds
917504 lines
$ /opt/perl-5.26.1/bin/perl demo.pl
duration memfh: 0.479 seconds
917504 lines
duration regex: 0.442 seconds
917504 lines
duration split: 0.066 seconds
917504 lines
duration trim : 0.259 seconds
917504 lines
duration walk1: 0.070 seconds
917504 lines
duration walk2: 0.084 seconds
917504 lines
Regards, Mario | [reply] [d/l] [select] |
Re: Fastest split possible
by Discipulus (Canon) on Nov 06, 2019 at 19:49 UTC
|
Hello pedrete,
to see it you can use Benchmark to test different approach, like capturing using regexes with g modifier.
But probably, if you are experienceing slowness with split which is fast, you simply have to avoid to accumulate results into @array
Infact if $HugeString is huge, would be probably better to process each element directly like in:
process($_) for split /\n/,$HugeString;
L*
There are no rules, there are no thumbs..
Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.
| [reply] [d/l] [select] |
Re: Fastest split possible
by haukex (Archbishop) on Nov 06, 2019 at 20:18 UTC
|
| [reply] |
Re: Fastest split possible (updated x2)
by AnomalousMonk (Archbishop) on Nov 06, 2019 at 20:09 UTC
|
For creating an array of substrings, split is probably fastest.
However, you usually want all those substrings so you can do something else with them. If you can process the string in place, it may be faster to do something like:
c:\@Work\Perl\monks>perl -wMstrict -le
"my $consonants = qr{ [^aeiouAEIOU]+ }xms;
;;
my $s = qq{four score\nand seven\nyears ago\nour fathers\n};
print qq{>$s<};
;;
my $offset = 0;
my $pos;
while (0 <= ($pos = index $s, qq{\n}, $offset)) {
substr($s, $offset, $pos) =~ s{ ($consonants) }{\U$1}xmsg;
$offset = ++$pos;
}
print qq{>$s<};
"
>four score
and seven
years ago
our fathers
<
>FouR SCoRe
aND SeVeN
YeaRS aGo
ouR FaTHeRS
<
Update 1: The original code in my reply doesn't do what I imagined (update: the substr length operand is incorrect). Here's code that works as I intended and has a debug print to illustrate what's happening:
c:\@Work\Perl\monks>perl -wMstrict -le
"my $consonants = qr{ [^aeiouAEIOU]+ }xms;
;;
my $s = qq{four score\nand seven\nyears ago\nour fathers\n};
print qq{>$s<};
;;
my $offset = 0;
my $pos;
while (0 <= ($pos = index $s, qq{\n}, $offset)) {
printf qq{==%s== \n}, substr($s, $offset, $pos-$offset);
substr($s, $offset, $pos-$offset) =~ s{ ($consonants) }{\U$1}xmsg;
$offset = ++$pos;
}
print qq{>$s<};
"
>four score
and seven
years ago
our fathers
<
==four score==
==and seven==
==years ago==
==our fathers==
>FouR SCoRe
aND SeVeN
YeaRS aGo
ouR FaTHeRS
<
Update 2: This trick (update: i.e., substr returns an lvalue) also works if you want to pass a substring alias to a processing function, but the function has to operate directly on the aliased subroutine argument element in order to preserve the aliasing "chain". Also with an illustrative print point:
c:\@Work\Perl\monks>perl -wMstrict -le
"my $consonants = qr{ [^aeiouAEIOU]+ }xms;
;;
my $s = qq{four score\nand seven\nyears ago\nour fathers\n};
print qq{>$s<};
;;
my $offset = 0;
my $pos;
while (0 <= ($pos = index $s, qq{\n}, $offset)) {
process(substr $s, $offset, $pos-$offset);
$offset = ++$pos;
}
print qq{>$s<};
;;
sub process {
printf qq{==%s== \n}, $_[0];
$_[0] =~ s{ ($consonants) }{\U$1}xmsg;
}
"
>four score
and seven
years ago
our fathers
<
==four score==
==and seven==
==years ago==
==our fathers==
>FouR SCoRe
aND SeVeN
YeaRS aGo
ouR FaTHeRS
<
Give a man a fish: <%-{-{-{-<
| [reply] [d/l] [select] |
Re: Fastest split possible
by harangzsolt33 (Chaplain) on Nov 07, 2019 at 03:41 UTC
|
I would probably try this, although I am not sure if it's the fastest. :
#
# This function splits a string into
# an array along new-line \n characters.
# Usage: ARRAY = SplitLine(STRING)
# (Tested with TinyPerl 5.8 running on Windows 7.)
#
sub SplitLine
{
defined $_[0] or return ();
my ($i, $j, $n, @A) = (0, 0, 0);
while (($i = index($_[0], "\n", $i)) >= 0)
{
$A[$n++] = substr($_[0], $j, $i - $j);
$j = ++$i;
}
$A[$n] = substr($_[0], $j, length($_[0]));
return @A;
}
Perl is beautiful. | [reply] [d/l] |
A reply falls below the community's threshold of quality. You may see it by logging in. |
|
|