qiau has asked for the wisdom of the Perl Monks concerning the following question:

Greetings fellows, I have a rather short question about "the fastest way to..". I have a string that kind of looks liks this:
$string = "123".\0. "1".\0.\0.\0. "23".\0.\0;
And so on, 4 chars for each value nullpadded. I have a procedure that fetches a value from the string based on a index but since I call that functions >60000 times I need the fastest way to get the value without the extra null-chars. Up to now I've tried two alternatives:
(undef, $value) = unpack("a$index A4", $string);
$value = substr($string, $index, 4); $value =~ s/\0//g;
But neither regexps nor unpack feels like the fastest way to do it. Is there any other way?

Replies are listed 'Best First'.
Re: Null-stripping performance
by BrowserUk (Pope) on Feb 26, 2007 at 13:13 UTC
      Oh, forgot to say that the strings are pretty large so I don't think arrays would be the best solution.

      Each string is about 1 Mbyte that I read in from a file upon request so a standard use is to open about four files (sometimes more, sometimes less) and read values from different indexes in those strings.

      I think I tried with a array in the beginning but realized that splitting into 1 Mbyte arrays wasn't fast at all :).
Re: Null-stripping performance
by Thelonius (Priest) on Feb 26, 2007 at 14:01 UTC
    I'd recommend:
    $value = unpack "x$index Z4", $string;
    $value = substr($string, $index, 4); $value =~ y/\0//d;
      On my PC, The "substr(), y" method takes about 0.6 microseconds, the "substr(), s" method takes about 1 microsecond, the "unpack x$index" method takes about 2 microseconds.

      The "unpack a$index" method takes about 41 microseconds (when averaged over a 262000 byte string), so it's the one to avoid. It has to allocate the large substring it skips over before it throws it away.

        Thanks a lot. I really appreciate your help! I can mention, as a parenthesis, that this code is going to run in sparc solaris-environment. When I tried the script at my PC perl took care of the null's by itself and just the substr worked fine. I guess I have to live with the time it takes to do it "substr(),y"-way. Thanks again
Re: Null-stripping performance
by grinder (Bishop) on Feb 26, 2007 at 17:42 UTC
    But neither regexps nor unpack feels like the fastest way to do it. Is there any other way?

    Without benchmarking, I don't know. But one thing you might want to try is to use the index function to determine how many characters to grab in your substr. This may save on the cost of the regexp engine, which might just be doing a Boyer-Moore search anyway.

    Here are a couple of snippets to try. If you can guarentee that there is always a null character in your four bytes, the problem is pretty trivial: just look for the next one starting at the current index point in the string:

    $value = substr($string, $index, index($string, "\x00", $index) - $index)

    Which literally means "take the substring of the string starting at the index point for n characters, where n characters is the difference between the index and the next NUL.

    If there are values like 9876 (that is, no NULs), then this simple-minded approach won't work, since you may run arbitrarily far down the string before encountering the next NUL, or there may be no more at all, in which case index will return -1, and things will really go askew.

    In this case, you'll have to save the result in a temporary length variable, and clamp it to 4 if it's not in range:

    my $len = index($string, "\x00", $index) - $index; $len = 4 if $len < 0 or $len > 4; $value = substr($string, $index, $len)

    That's probably about the best you can do (in terms of other alternatives). If this approach is worse, or only marginally better, the only remaining card to play would be to code it in C and use Inline::C.

    A bit later: regarding unpack, in my experience, when I'm really strapped for speed it rarely makes the grade with respect to other alternatives. I think the meme that says "unpack is the fastest" needs to be taken with a grain of salt, and my hunch is that it takes a certain amount of time to decode the format string argument. A single substr with offsets determined by index is usually faster (but at the cost of more make-work code). YMM undoubtedly V.

    • another intruder with the mooring in the heart of the Perl

      This looks interesting, I have to take a look at the data tomorrow when I get to work but I do like the ideas here.

      A few benchmarks too, thismight acually be something!

      Thanks a lot for the help!
Re: Null-stripping performance
by dragonchild (Archbishop) on Feb 26, 2007 at 15:50 UTC
    No matter what you do, this kind of function is going to be about 3 lines. It needs a good 30-60 lines of comments explaining what the problem was, why this solution is better, and demonstrating other solutions that didn't work as well. Furthermore, it should probably contain the URL to this thread so that future readers of your code can find the people who made the suggestions (Thelonius++) and make their own conclusions.

    My criteria for good software:
    1. Does it work?
    2. Can someone else come in, make a change, and be reasonably certain no bugs were introduced?
      First of all, I can't see what your replay has anything to do with my question.

      You acually don't have any idea of where/why/what/how I'm going to use this code. This feels more like you have a grudge against me personaly.

      I've been programming for about 20 years now and kind of know the value of documentation so I would be glad if you could tell me what I wrote to make you think that I didn't appreciate the help. Or that I had no clue of why I should comment my code.

      I'm not ironic, I would really want to know since I don't want to look stupid while posting questions here (as I do reading your reply). I know my english isn't super but most of the time english grammar affects programming as much as the color of the chair I'm sitting on, none..

      Best regards, Qiau
        99% of all questions on Perlmonks are from novice and intermediate programmers. As such, the concept of commenting difficult sections of code is alien to them. As this question deals with a counter-intuitive optimization, mentioning that commenting it well is a reply well-served for most people who ask questions here (and on #perl and on #cgiapp and ... etc).

        I wholeheartedly apologize if my reply insulted you - it was not meant as such. Note - I didn't even notice that you weren't a native English speaker.

        My criteria for good software:
        1. Does it work?
        2. Can someone else come in, make a change, and be reasonably certain no bugs were introduced?
Re: Null-stripping performance
by BrowserUk (Pope) on Feb 26, 2007 at 18:17 UTC

    Another thought that would probably be more efficient, if all your null padded fields are numeric.

    Perl quite happily interprets strings that contain digits and leading or trailing spaces as numbers without requiring the spaces to be removed first. If you simply replace all the nulls with spaces in a single operation:

    $str =~ tr[\0][ ];

    From that point on you can just use substr:

    sub fetchOne{ substr $str, $_[ 0 ] * 4, 4 }

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.