Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?

Performance problems on splitting long strings

by Laurent_R (Abbot)
on Jan 30, 2014 at 19:07 UTC ( #1072715=perlquestion: print w/replies, xml ) Need Help??
Laurent_R has asked for the wisdom of the Perl Monks concerning the following question:

Dear fellow monks,

I have to process two large CSV files (about 6 GB each). No problem at getting the individual fields. In each record, one of the fields is a long string of alphanumerical characters with typically 150 to 300 such characters (the number of characters is always a multiple of 5). I need to split that string into groups of five characters, in order to then reorganize that string. As far as I can say, the split is not appropriate for that. I used a regular expression, something like this:

my @sub_fields = $field16 =~ /\w{5}/g; # ...
But the process is very slow and profiling the program shows that the line above takes far too much time. I intend to do some benchmark to try to find something faster. Maybe a faster regex can be found (for example /.{5}/g might be better. I will also try to use the substr function in a loop to see if that goes faster, but I would be very happy if some nice monk could come up with some other idea likely to bring higher performance.

Another idea that I had was to use the unpack function, but I do not use it often and I am not sure how to use it to produce an array from variable-length lines. Presumably, the template should be something like "A5A5A5...". Is there any way of saying something like: "A5" repeated as many times as possible (until the end of the string? Or do I have to use a different template for each possible string length?

I was also thinking on the possibility of opening a filehandle on a reference to the string and using the read function in a loop to populate an array of chunks of five characters, but I doubt that opening a filehandle for each record of my input will really improve performance.

Does anyone out there have another better idea for improving performance, so that I could include it in by benchmark?

Thank you for your help.

Update: I of course meant to say A5 for the unpack template, not A4 as I originally typed by mistake. Thanks to those who pointed out this typo. I corrected it above to be more consistent with the text.

Replies are listed 'Best First'.
Re: Performance problems on splitting long strings
by Kenosis (Priest) on Jan 30, 2014 at 20:02 UTC

    Just fyi:

    use strict; use warnings; use Tie::CharArray; use Benchmark qw/cmpthese/; my $string = join '', 'A' .. 'Y'; sub _unpack { my @arr = unpack '(A5)*', $string; } sub _regex { my @arr = $string =~ /.{5}/g; } sub _split { my @arr = split /.{5}\K/, $string; } sub _substr { my @arr; for ( my $i = 0 ; $i < length $string ; $i += 5 ) { push @arr, substr $string, $i, 5; } } sub _open { my @arr; open my $sh, '<', \$string; while ( read $sh, my $chars, 5 ) { push @arr, $chars; } } cmpthese( -5, { _unpack => sub { _unpack() }, _regex => sub { _regex() }, _split => sub { _split() }, _substr => sub { _substr() }, _open => sub { _open() } } );


    Rate _open _regex _substr _split _unpack _open 265986/s -- -53% -55% -57% -70% _regex 563780/s 112% -- -5% -8% -36% _substr 593788/s 123% 5% -- -3% -33% _split 612001/s 130% 9% 3% -- -31% _unpack 881949/s 232% 56% 49% 44% --

      Borrowing heavily from Kenosis' code (thanks), regex seems to be faster than unpack (at least using substitution):

      Rate _substr _unpack _regex _split _substr 2187335/s -- -11% -16% -20% _unpack 2457294/s 12% -- -6% -10% _regex 2612321/s 19% 6% -- -4% _split 2726283/s 25% 11% 4% --

      Perl code:

      use strict; use warnings; use Benchmark qw/cmpthese/; my $string = join '', 'A' .. 'Y'; sub _unpack { my @arr = unpack '(A5)*', $string; } sub _regex { my @arr; while (length $string){ $string =~ s/^(.{5})//; push @arr, $1; } } sub _split { my @arr = split /.{5}\K/, $string; } sub _substr { my @arr; for ( my $i = 0 ; $i < length $string ; $i += 5 ) { push @arr, substr $string, $i, 5; } } cmpthese( -5, { _unpack => sub { _unpack() }, _split => sub { _split() }, _substr => sub { _substr() }, _regex => sub { _regex() } } );

        Your benchmark is totally broken.

        When your _regex() function runs the first time, it complete destroys $string; and everytime after that the regex is operating on an empty string and thus runs very quicly. Ditto, every other test that runs after the first run of _regex().

        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Performance problems on splitting long strings
by Not_a_Number (Parson) on Jan 30, 2014 at 19:26 UTC
    Is there any way of saying something like: "A4" repeated as many times as possible (until the end of the string?


    my @array = unpack '(A4)*', $string;
      Thank you very much, I had overseen this option of unpack. Yous answer is really useful for me.

        Just FYI: You refer in several places to wanting groups of five characters, but  'A4' or  '(A4)*' will give you a group or groups of four characters, trailing-space padded if necessary.

Re: Performance problems on splitting long strings
by davido (Archbishop) on Jan 31, 2014 at 00:45 UTC

    Why does the long string have to be split into individual containers? Can you work with the large string, just in segments at offset multiples of five?

    unpack is already lightning fast, but even less work is being done if you can avoid making a copies. Of course this is Perl, and sometimes the more idiomatic approaches (unpack, for example) can be faster than algorithmic approaches (walking through the existing string without making copies), but if unpack is still too slow, and if what you are using those little substrings for could be done in-place, it might be worth attempting a solution that works on the original string.


      An example of davido's in-place string modification combined with boftx's dispatch table handler:

      >perl -wMstrict -le "use constant LEN => 5; ;; my $s = '1234598765555553456733333'; print qq{'$s'}; ;; my %dispatch = ( '55555' => sub { return 'x' x length $_[0]; }, ); ;; for (my $offset = 0; $offset < length $s; $offset += LEN) { for (substr $s, $offset, LEN) { $_ = exists $dispatch{$_} ? $dispatch{$_}->($_) : $_ + 2; } } print qq{'$s'}; " '1234598765555553456733333' '1234798767xxxxx3456933335'

      Update: Changed example code to also exemplify topicalization of sub-string segment via for-structure (given no longer being quite kosher).

      Hi Dave, thank you for your input, the truth of the matter is that I am the victim of earlier poor design. The long string that I am talking about is a list of billing services for a client. All billing service codes are 5 character-long. What I need to do with this list (with a bit of simplification) is to filter in those services that are of use for our purpose, and to sort them out. The ultimate goal is to compare the two large files after having pre-processed them.

        Sounds like there is going to be a dispatch table in there somewhere to handle what to do with the different billing codes that are of interest. :)

        It helps to remember that the primary goal is to drain the swamp even when you are hip-deep in alligators.
Re: Performance problems on splitting long strings
by Laurent_R (Abbot) on Jan 31, 2014 at 18:33 UTC

    The following is the post I prepared yesterday evening (after 2 a.m. this morning in fact), but I guess I was too tired: I previewed it, but stupidly forgot to hit the create button.

    OK, now i ran a benchmark with the various possibilities. The results might be useful to others.

    I tried 8 different solutions: a "C_style" solution (converting the string into an array of individual characters), two regexes (one with /\w{5}/ and one with /.{5}/, one with the opening of a file handler on a reference to the string, two variations on a loop with the substr function, the split solution offered by Kenosis (although I probably would not be able to use it with the old version of Perl that we have on our servers, but I could test it at home on my more recent version) and unpack.

    The following is the code (borrowed in part from Kenosis):

    And these are the results:
    Rate C_Style regex1 regex2 FH split substr1 substr +2 unpack C_Style 5598/s -- -72% -72% -78% -79% -80% -80 +% -83% regex1 19690/s 252% -- -1% -24% -28% -28% -31 +% -39% regex2 19968/s 257% 1% -- -23% -26% -27% -30 +% -38% FH 25984/s 364% 32% 30% -- -4% -5% -9 +% -20% split 27161/s 385% 38% 36% 5% -- -1% -5 +% -16% substr1 27355/s 389% 39% 37% 5% 1% -- -4 +% -16% substr2 28523/s 409% 45% 43% 10% 5% 4% - +- -12% unpack 32428/s 479% 65% 62% 25% 19% 19% 14 +% --
    So unpack wins clearly the race, but I was surprised to see that substr is not that far behind.

    Update this evening (Jan 31, 2014 at 18:45): I incorporated the unpack solution in my program at work today, and the speed gain I obtained on my real data is significantly better than what could be derived from the figures of the benchmark above. The profiling shows that the modified code line runs surprisingly almost twice faster than the original one.

      You can simplify the calls to the anonymous subroutines.

      cmpthese(-1, { regex1 => $regex1, regex2 => $regex2, unpack => $unpack, split => $split, substr1 => $substr1, substr2 => $substr2, FH => $filehandle, C_Style => $c_style_string, } );

      IMHO, the superior performance of unpack() is perfectly predicable. This is what it's for.


        You can simplify the calls to the anonymous subroutines. ...

        Thank you for your comment, Jim.

        And, yes, I wanted to write something like that, and that the reason why I built references to anonymous subs in the first place, rather than simple named functions. But for some reason, I got something wrong in the syntax for calling the subs in cmpthese, I am not sure to remember exactly, but I think I first did something like:

        regex1 => $regex1->(),
        which gave compile errors. The first quick way I found to make it work was to wrap the function call in a sub block like this:
        regex1 => sub {$regex1->()},
        I realize that this is not the most elegant construct, but once it worked, I was happy enough to get my results and I was too tired, at around 2 a.m., to spend more time investigating further how to simplify the calls.

        And yes, I was sort of expecting unpack() to be faster, but it is still better to try it to be sure.

Re: Performance problems on splitting long strings
by hdb (Prior) on Jan 30, 2014 at 19:22 UTC

    What have you tried? A good point to start is to look at the documentation of Benchmark.

      Thank you. I know how to use the benchmark module, no problem with that, what I am looking for is some other ideas on how to split my data more efficiently, in order to benchmark these ideas.

        Why then did you not bother to write a few lines like:

        use strict; use warnings; use Benchmark 'cmpthese'; my $string = map { ('a'..'z')[rand 26] } 1..30; my @sub_fields; cmpthese( -1, { regex1 => sub { @sub_fields = $string =~ /\w{5}/g }, regex2 => sub { @sub_fields = $string =~ /.{5}/g }, unpack => sub { @sub_fields = unpack '(A4)*', $string }, substr => sub { @sub_fields = map { substr $string, 5*$_, 5 + } 0..length( $string )/5-1 }, });

        that already shows that the regex idea is vastly inferior:

        Rate substr unpack regex1 regex2 substr 696486/s -- -57% -94% -94% unpack 1603093/s 130% -- -85% -86% regex1 10731041/s 1441% 569% -- -4% regex2 11165392/s 1503% 596% 4% --
Re: Performance problems on splitting long strings
by boftx (Deacon) on Jan 30, 2014 at 19:49 UTC

    You would have to benchmark this to be sure, but you might try spliting on '' to get a char array then shift or splice off multiples of 5 chars at a time. (Though I suspect unpack will be faster.)

    It helps to remember that the primary goal is to drain the swamp even when you are hip-deep in alligators.
Re: Performance problems on splitting long strings
by Laurent_R (Abbot) on Jan 31, 2014 at 07:45 UTC
    I made the benchmark yesterday evening (well, finished it around 2 a.m. this morning) and thought I posted it. It seems I must have previewed it, but omitted to create the post. I am no longer at home, posting again will have to wait this evening. To make things short, unpack was the fastest options among those I tested.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1072715]
Front-paged by Corion
[Discipulus]: erix the right order of books is: The Three Musketeers, Twenty Years After, The Count of Moret; The Red Sphinx (not in the cycle but a must read!), The Vicomte de Bragelonne

How do I use this? | Other CB clients
Other Users?
Others pondering the Monastery: (8)
As of 2017-05-24 07:11 GMT
Find Nodes?
    Voting Booth?