Splitting a string to chunks

by spurperl (Priest)
 on Nov 29, 2006 at 13:33 UTC Need Help??
spurperl has asked for the wisdom of the Perl Monks concerning the following question:

Something that comes up fairly often is a need to split a string to equal sized chunks. For instance, given the string "abcdefgh12345678", splitting it to 4-char chunks would produce ("abcd", "efgh", "1234", "5678"). Looking around the monastery, there're at least a couple of posts I have found.

I tried to time some different techniques against each other:

```my \$str = "abcdefgh12345678" x 20;
my \$strlen = length \$str;

cmpthese(50000, {

'grep_split' => sub
{
my @arr = grep {\$_} split /(.{8})/, \$str;
},

'split_pos' => sub
{
my @arr = split /(?(?{pos() % 8})(?!))/, \$str;
},

'substr_map' => sub
{
my \$len = length \$str;
my @arr = map {substr(\$str, \$_ * 8, 8)} (0 .. \$strlen / 8 - 1);
},

'substr_loop' => sub
{
my @arr;
my \$len = length \$str;
for (my \$i = 0; \$i < \$len; \$i += 8)
{
push(@arr, substr(\$str, \$i, 8));
}
},

'unpack' => sub
{
my @arr = unpack('(A8)*', \$str);
}
});

And the results are quite surprising:

```               Rate
split_pos    3203/s
grep_split   6425/s
substr_map   8889/s
unpack      11348/s
substr_loop 15097/s

Contrary to what I have expected from my understanding (that built in functions should be faster than loops), the looping solution is the swiftest. It beats the unpack by a margin ranging from 15 to 50 percent, depending on the length of the string and the chunks.

Any way to make it faster ?

Replies are listed 'Best First'.
Re: Splitting a string to chunks
by Limbic~Region (Chancellor) on Nov 29, 2006 at 13:50 UTC
spurperl,
I was suprised to see that unpack wasn't the fastest so I changed it just a bit.
```my @arr = unpack('A8A8A8A8A8A8A8A8A8A8A8A8A8A8A8A8A8A8A8A8', \$str);
Not only is that compatible with older perl's - it is now the fastest. I might play a bit more to see if I can get an even faster version but to be fair, that really should have been:
```# No longer wins but is still faster than '(A8)*'
my @arr = unpack((join '', ('A8' x (\$strlen / 8))), \$str);

Update: I wanted to see what would happen if the benchmark focused more on the functions themselves by removing some of the intermediate calculations. Noticed also I changed x 20 to x 200.

Cheers - L~R

Re: Splitting a string to chunks
by duff (Parson) on Nov 29, 2006 at 13:52 UTC

On my system, I get a different result:

```               Rate   split_pos  grep_split  substr_map substr_loop      unpack
split_pos    4596/s          --        -53%        -71%        -79%        -82%
grep_split   9843/s        114%          --        -37%        -54%        -61%
substr_map  15674/s        241%         59%          --        -27%        -38%
substr_loop 21459/s        367%        118%         37%          --        -15%
unpack      25381/s        452%        158%         62%         18%          --
```
Your performance characteristics depend on all sorts of things relating to your CPU, its cache, bus speed, memory, etc.

But as far as ways to make it faster, you might want to use an idiomatic for loop instead of the C-style loop.

Re: Splitting a string to chunks
by Fengor (Pilgrim) on Nov 29, 2006 at 13:48 UTC
```'regex' => sub
{
my @arr = \$str =~ /(........)/g
}
didn't time it though.

--
"WHAT CAN THE HARVEST HOPE FOR IF NOT THE CARE OF THE REAPER MAN"
-- Terry Pratchett, "Reaper Man"

Hi,

I added another version, that split string that split with a smaller last chunk. Added also a /o, to improve performace (that can be used if you have several lines to split.

```'regex' => sub
{
my @arr = \$string =~ /(........)/g;
},

'regexo' => sub
{
my @arr = \$string =~ /(.{1,8})/og;
},
The results:
```                 Rate split_pos  split grep_split substr_map substr_lo
+op unpack regex regexo
split_pos      7295/s        --   -57%       -60%       -68%        -7
+7%   -78% -100%  -100%
split         16900/s      132%     --        -7%       -26%        -4
+7%   -50% -100%  -100%
grep_split    18241/s      150%     8%         --       -20%        -4
+3%   -46% -100%  -100%
substr_map    22883/s      214%    35%        25%         --        -2
+9%   -32%  -99%  -100%
substr_loop   32139/s      341%    90%        76%        40%
+--    -4%  -99%   -99%
unpack        33495/s      359%    98%        84%        46%
+4%     --  -99%   -99%
regex       4342185/s    59421% 25593%     23705%     18876%      1341
+1% 12864%    --    -6%
regexo      4596612/s    62909% 27098%     25099%     19988%      1420
+2% 13623%    6%     --
umhmm you got my typo. i accidentally used \$string instead of \$str in my post first. that explains the high rates for the regex solution. here is the timing with the typo corrected:
```               Rate split_pos grep_split substr_map regexo regex subst
+r_loop unpack
split_pos    5587/s        --       -65%       -69%   -76%  -77%
+  -79%   -81%
grep_split  15974/s      186%         --       -12%   -32%  -34%
+  -40%   -45%
substr_map  18051/s      223%        13%         --   -23%  -26%
+  -32%   -38%
regexo      23474/s      320%        47%        30%     --   -3%
+  -12%   -20%
regex       24272/s      334%        52%        34%     3%    --
+   -9%   -17%
substr_loop 26596/s      376%        66%        47%    13%   10%
+    --    -9%
unpack      29240/s      423%        83%        62%    25%   20%
+   10%     --

--
"WHAT CAN THE HARVEST HOPE FOR IF NOT THE CARE OF THE REAPER MAN"
-- Terry Pratchett, "Reaper Man"

themage,
Your benchmark disagrees with mine (with x 20 and x 200). Additionally, I think you should re-read perlre with regards to what /o does.

I am sure diotalevi will improve upon my explanation but in a nutshell, /o is an old optimization predating qr//. If you needed to interpolate a variable inside a regex such as /\$regex/ but knew that \$regex would never change, the flag would tell perl to only compile the regex once. In fact, if you broke your promise and changed \$regex then it would still not recompile it leading to buggy code. Then came along qr// and improved things greatly (see /o is dead, long live qr//!).

Since you are not using a variable in your interpolation - the /o is having no effect.

See also this regarding how current perl's optimize regex compiling. Unfortunately I couldn't seem to find this in any perldelta from 5.6.1 to 5.9.4 which makes me suspicious so I posted Questions concerning /o regex modifier.

Cheers - L~R

Without having run your benchmark, the huge disparity between your solutions and the others make me very suspicious that your code is not producing the same results as the others. Have you checked?

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Splitting a string to chunks
by Not_a_Number (Prior) on Nov 29, 2006 at 15:15 UTC
Any way to make it faster ?

On my machine, this is slightly faster still:

```'substr_loop2' => sub
{
my @arr;
my \$s = \$str;
push @arr, substr \$s, 0, 8, '' while \$s;
},

More seriously, though, not all the subs in your OP are equivalent: 'substr_map' will truncate any string at a multiple of eight characters, while the others will include the extra characters in the final element of the array (Fengor's my @arr = \$str =~ /(........)/g has the same problem).

thx for pointing out. what about
```'regexpad' => sub
{
my \$padding = 8 - length(\$str%8) if length(\$str%8); #has to be 8 - m
+odulo not modulo, thx johngg

# dividing the string in parts of 8 chars
my @arr = \$str =~ /(........)/g;

}
although its a bit slower than the other 2 regex solutions
```               Rate split_pos grep_split substr_map regexpad regexo re
+gex substr_loop unpack
split_pos    5841/s        --       -64%       -70%     -75%   -76%  -
+76%        -79%   -80%
grep_split  16129/s      176%         --       -17%     -30%   -33%  -
+34%        -42%   -45%
substr_map  19531/s      234%        21%         --     -15%   -19%  -
+20%        -30%   -33%
regexpad    22936/s      293%        42%        17%       --    -5%
+-6%        -17%   -22%
regexo      24038/s      312%        49%        23%       5%     --
+-1%        -13%   -18%
regex       24272/s      316%        50%        24%       6%     1%
+ --        -13%   -17%
substr_loop 27778/s      376%        72%        42%      21%    16%
+14%          --    -5%
unpack      29240/s      401%        81%        50%      27%    22%
+20%          5%     --

--
"WHAT CAN THE HARVEST HOPE FOR IF NOT THE CARE OF THE REAPER MAN"
-- Terry Pratchett, "Reaper Man"

I think your "padding" algorith might be a bit wonky. Given a string of length, say, 19 characters you would arrive at a \$padding value of 3, thus padding your \$str with three "x"s to end up with a length of 22, not 24 as I think you wanted. This should work (not tested)

```my \$padding = 8 - (\$str % 8);

The remove padding part would be something like (again, not tested)

```substr \$arr[-1], -\$padding, \$padding, q{} if \$padding;

Cheers,

JohnGG

Update: I must have been half-asleep; where's the length call? Line should be

```my \$padding = 8 - (length(\$str) % 8);

You can't do modulo on a string :)

```\$ perl -le '\$str = q{abc}; \$pad = \$str % 8; print \$pad;'
0
\$ perl -le '\$str = q{abcdefghijkl}; \$pad = \$str % 8; print \$pad;'
0
\$

Create A New User
Node Status?
node history
Node Type: perlquestion [id://586695]
Approved by Limbic~Region
help
Chatterbox?
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others imbibing at the Monastery: (4)
As of 2018-04-20 13:44 GMT
Sections?
Information?
Find Nodes?
Leftovers?
Voting Booth?
My travels bear the most uncanny semblance to ...

Results (77 votes). Check out past polls.

Notices?