Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

Efficient walk/iterate along a string

by gje21c (Acolyte)
on Nov 22, 2010 at 22:54 UTC ( #873068=perlquestion: print w/ replies, xml ) Need Help??
gje21c has asked for the wisdom of the Perl Monks concerning the following question:

Folks,

A seemingly very simple question which I have searched pretty hard already, honest. I'm writing proteomics code in Perl and spend a lot of time scanning down long protein strings, like

GVGFTVILISFYVGFFYNVIIAWALHYFFSSFTMDLPWIHCNNTWNSPNCSDAHASNSSDGLGLNDTFGTTPAAEYFER.

These can be 50 to 5000 chars long. I have heaps of code like

for $i (0 .. $max) { $c = substr($protein, $i, 1); # Process $c etc etc }

Profiling the code with Devel::FastProf I find that lots of time goes in the "for" loop and the "substr" call, not unexpectedly. Now, I'm pretty efficient with general coding and string handling, and have optimised piles of regexp matches after trying variants under FastProf, etc etc but to my surprise I can't find an efficient special "walk down a string" technique. Other languages have "iterators" which understand special cases like a plain walk along a string and give you an optimised interface, ie they remember where you are in the string and can give you the next char with basically one op.

I can split the string into an array with @array = split( //, $protein) and use array indexing, but this seems to only gain 20% in the best test case, and about nothing in my real code, ie the cost of splitting nullifies the gain. The protein strings are re-used about 20 times, so the split cost is amortised a little, but the gain is small. To give you a tiny flavour of the data sizes: 40,000 proteins, average lnegth 400, each protein scanned 20 times.

Ok enough guff, any suggestions appreciated.

Thanks, Greg E

Comment on Efficient walk/iterate along a string
Download Code
Re: Efficient walk/iterate along a string
by Corion (Pope) on Nov 22, 2010 at 23:02 UTC

    substr isn't exactly cheap - depending on what you really want to do with a single character, a regular expression moving across the string might (or might not) be faster. See perlop and perlre for /gc.

    while ($protein =~ /\G(.)/gc) { print $1; ... };
Re: Efficient walk/iterate along a string
by BrowserUk (Pope) on Nov 22, 2010 at 23:20 UTC

    The fastest way I've found of accessing the characters of a string is to use chop.

    Even if you need to copy the string to avoid its destruction; and reverse it to get the characters in the right order, it still comes out substantially quicker than any other method I've tried.

    If you can avoid both the copy and the reverse it gets much faster still, but that's not easy to benchmark due to the destructive process. But if you are reading the records one at a time from a file, you're probably going to overwrite the record at each iteration, so that often doesn't matter in a real application.

    #! perl -slw use strict; use Benchmark qw[ cmpthese ]; our $string = 'x'x5000; cmpthese -1, { substr => q[ for ( 0..length( $string)-1 ) { my $c = substr $string, $_, 1; } ], splitArray => q[ my @c=split'',$string; for( 0 .. $#c ){ my $c = $c[$_]; } ], splitFor => q[ for( split'', $string ){ my $c = $_; } ], unpack => q[ for( unpack 'C*', $string ) { my $c = chr; } ], reverseChop => q[ my $s = reverse $string; my $c; $c = $_ while chop $s; ], chop => q[ my $s = $string; my $c; $c = $_ while chop $s; ], ramfile => q[ open my $ram, '<', \$string; my $c; $c = $_ while $_ = getc( $ram ); ], }; __END__ C:\test>873068 Rate splitArray splitFor ramfile substr unpack reverseCh +op chop splitArray 169/s -- -48% -74% -81% -83% -8 +9% -89% splitFor 323/s 91% -- -51% -64% -67% -7 +9% -79% ramfile 654/s 288% 103% -- -27% -34% -5 +7% -58% substr 891/s 428% 176% 36% -- -9% -4 +2% -43% unpack 984/s 484% 205% 51% 10% -- -3 +6% -37% reverseChop 1534/s 809% 376% 135% 72% 56% +-- -1% chop 1555/s 822% 382% 138% 74% 58% +1% --

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      Just a quick warning: if the string contains 0s, the chop solution stops as soon as one is found. Also chop does not set $_ to the removed character.

      A version that fixes both problems, and is still faster than unpack is

      my $s = reverse $string; my $c; $c = chop $s while length $s;

        True, but the extra opcode (length) imposes a 30% hit.

        Of course that diminishes if you're doing anything useful within the loop, but it is still significant for those situations where using such an obscure mechanism is worth considering.

        Luckily, genomic data doesn't usually contain zeros or nulls.

        I think it would be really nice if in the same way that in 5.12 you can use each on arrays, it would be nice to use it on scalars:

        my $string = 'fred'; my( $i, $c ); say "$i:$c" while ($i,$c) = each $string; 0:f 1:r 2:e 3:d say while $_ = each $string; f r e d

        I think that could be made very efficient by aliasing a LvTARG to the characters in situ; and would be very useful.

        IMO far more useful than the single character saving of each $arrayRef; over each @$arrayRef, which unfortunately probably means that it could not now be implemented :(


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Efficient walk/iterate along a string
by AnomalousMonk (Abbot) on Nov 22, 2010 at 23:21 UTC

    I haven't Benchmarked against anything, but unpack might be faster for splitting to an array:

    >perl -wMstrict -le "my $str = 'abcde'; my @chrs = unpack '(a)*', $str; print qq{'$_'} for @chrs; " 'a' 'b' 'c' 'd' 'e'
Re: Efficient walk/iterate along a string
by aquarium (Curate) on Nov 22, 2010 at 23:30 UTC
    just throwing it out there..what about re-encoding the sequences as a 5 bit chunk binary, and then you can use binary operations. any faster?
    the hardest line to type correctly is: stty erase ^H
Re: Efficient walk/iterate along a string
by jwkrahn (Monsignor) on Nov 23, 2010 at 01:00 UTC

    I don't know if this is faster but you could try substr with references, for example:

    for my $i ( 0 .. $max ) { my $c = \substr( $protein, $i, 1 ); # Process $$c etc etc }
Re: Efficient walk/iterate along a string
by GrandFather (Cardinal) on Nov 23, 2010 at 01:02 UTC

    Perl is a tool that handles complicated string manipulations well, but it's not designed to handle individual characters near so well. For that you need something closer to the silicon like C. However, on a case by case basis it may be that you don't need to handle individual characters and instead can use index or a regex to do much of the heavy lifting. The alternative may be to invest the time in writing an XS module that provides an iterator to efficiently traverse a string returning a character at a time and providing the current index on demand.

    True laziness is hard work
      Please do this - I've occasionally had to solve similar problems, and thought "this is where perl needs pointers", but never had the tuits to actually implement it.

        I suspect that if there was a real need for this and an XS solution does actually provide substantial speed gains over substr or the reverse and chop trick then it'd have already been done.

        Most times a smarter algorithm gives much more bang for the buck than tinkering with an implementation to make the old algorithm a little faster.

        True laziness is hard work
Re: Efficient walk/iterate along a string
by roboticus (Canon) on Nov 23, 2010 at 01:08 UTC

    gje21c:

    I think you may have simplified just a bit too much. Since you focused in on character-by-character iteration, you've pretty much constrained possible solutions into that mold. Depending on what you're doing, there may be methods that are significantly faster.

    Are you looking for particular clusters to alter? If so, you may find index to be faster to locate the clusters of interest.

    You mention that each protein is used 20 times. But that's for one run of your program, right? Are you going to use the program multiple times? If you take number of runs into account, and number of uses overall of the proteins, it may be possible that you can build a database of protein substrings and then just look them up in the database rather than having to dig for them sequentially.

    If you can give us a little more information about the operations you're performing, perhaps we can be a little more helpful.

    ...roboticus

      i was also going to suggest that maybe refactoring may be possible if more context is known. index or straight regex can be used in a loop to traverse a string, don't know if more efficient or not. using index function with each letter (alphabetically) and incrementing the start index, could also be worth a try, but only if you're ok to traverse the string this way. there's also databases with bitmap indexes, which would allow very fast access to (pre-fetched cache) of particular sequences. but all depends on the semantics of your larger goal. a question that comes to mind, why is each protein sequence processed 20 times?
      the hardest line to type correctly is: stty erase ^H
Re: Efficient walk/iterate along a string
by TomDLux (Vicar) on Nov 23, 2010 at 01:44 UTC
    • Use benchmark to compare different ways of processing.
    • Use Devel::NYTProf to determine what portions of your program are slow
    • Can you process on clumps of proteins instead of one-by-one?
    • Are there alternate algorithmns for the various tasks you need to achieve?

    As Occam said: Entia non sunt multiplicanda praeter necessitatem.

Re: Efficient walk/iterate along a string
by gje21c (Acolyte) on Nov 23, 2010 at 05:56 UTC
    Hi

    Thanks for all those suggestions, I have implemented chop so far and got a solid 25% saving, even with overhead of reverse. Very worthwhile. Program run time is 5 minutes as one-off, 45 minutes in a full test suite, so 25% is good saving

    The responders who mentioned I didn't give much context are of course correct, but I chose brevity. I have all sorts of requirements re scanning the protein strings, and I'm happy with my regexp scans, index's, other such things that use the structure of a ~20 character peptide sequence within the ~400 character protein, those are pretty ok and are not topping the list in the FastProf profile, just the areas where I do a dumb-as-hell one character at a time scan. These inlcude things like scanning all the proteins and looking for an essentially arbitrary logic in the strings, ie. break the protein (as per Trypsin protease) at a K or R, and count the S,T.Y residues in each segment. I don't think there's any way to speed that up with pre-processing and clever mapping and saving results for re-use, you just have to come up with the world's best way to scan down strings one char at a time. That's my take aayway.

    I'll keep looking at your replies though, thanks again.

      If I understand the Trypsin example (I have no knowledge of this problem domain) the following code leverages Perl's string handling to produce the results I think you are looking for in this narrow case.

      use strict; use warnings; my $str = 'GVGFTVILISFKYVGFFYNVIIAWALHYFFSSFTMDLPRWIHCNNTWNSPNCSDAHASN +SSDGLGLNDTFGTTPAAEYFER'; my @segments = map {[$_, (scalar tr/STY//)]} split /(?<=[KR])/, $str; printf "%45s: %d\n", @$_ for @segments;

      Prints:

      GVGFTVILISFK: 2 YVGFFYNVIIAWALHYFFSSFTMDLPR: 6 WIHCNNTWNSPNCSDAHASNSSDGLGLNDTFGTTPAAEYFER: 10

      I've no idea how the performance may compare with whatever you are doing now, but if this does anything like the task you need then I'm sure it will be an order of magnitude faster than a character at a time implementation.

      True laziness is hard work
        Monks

        I tried the split/tr/map approach suggested by Grandfather and got ~ 60% reduction in time, great ! (and yes you got my app requirement exactly right)

        I think should withdraw my words that there's nothing to be gained from the special functions when you're just marching down strings and doing a seemingly arbitrary logic function (and also my words about inadequate context). Splitting at simple marker points and counting a simple set of chars within the splits, combined with Perl's superb list handling, is not arbitrary spaghetti logic. Lesson learned.

        One query: the ?<= construct was new to me, I've played with it and read up on it (perlre/Extended Patterns/Look-Around Assertions) and I'm 80% there but if anyone could explain exactly how this RE is ensuring I get exactly what I need I'd be grateful.

        Will continue to check some other posts, thanks all

        GJE

        Grandfather,

        Hope I've got the techno details and the "protocol" right here.

        I replied to your repy on my main thread as below. Appreciate if you could elucidate (love that word!) the regexp as per my words below. Various explorations with variants on ?<= give me the protein string split on the KR characters, with or without null strings, strings of just K or R, strings where consective K/R were condensed to just one, or none. I'm sure I can work it out with about 5 hours of exploration, but if you could just put into words what the ?<= is doibg I'd ber grateful.

        Thanks, GE.

        Cheers, GE

        -------------

        Monks
        I tried the split/tr/map approach suggested by Grandfather and got ~ 60% reduction in time, great ! (and yes you got my app requirement exactly right)

        I think should withdraw my words that there's nothing to be gained from the special functions when you're just marching down strings and doing a seemingly arbitrary logic function (and also my words about inadequate context). Splitting at simple marker points and counting a simple set of chars within the splits, combined with Perl's superb list handling, is not arbitrary spaghetti logic. Lesson learned.

        One query: the ?<= construct was new to me, I've played with it and read up on it (perlre/Extended Patterns/Look-Around Assertions) and I'm 80% there but if anyone could explain exactly how this RE is ensuring I get exactly what I need I'd be grateful.

        Will continue to check some other posts, thanks all

        GJE

        20101206 Janitored by Corion: Added formatting, code tags, as per Writeup Formatting Tips

      IMHO you can do most of the work in one single regex with nested groupings, unfortunately I don't have the time until next week to fiddle out how exactly.

      But most probably you'll get a faster response, since this is the monastery ... :)

      Cheers Rolf

      PS: but no guaranty that it's faster...

Re: Efficient walk/iterate along a string
by SimonClinch (Chaplain) on Nov 23, 2010 at 10:46 UTC
    It's probably faster to use the C-style for loop,
    for ( my $i=0; $i <= $max; $i++ ) { }
    On the other hand, you might even want to do this whole loop from within C itself. See ch 21.3. Extending Perl (Using C from Perl) of Programming Perl by Larry Wall, Tom Christiansen, and Jon Orwant (ISBN 059600027).

    erm correction, I mean do the substring extraction from C not the processing you do with it. Although the perl loop will indeed consume less resources if C-style instead of forcing a huge list to be constructed.

    One world, one people

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://873068]
Approved by Corion
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others rifling through the Monastery: (12)
As of 2014-12-29 17:01 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (194 votes), past polls