Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Re: Efficient walk/iterate along a string

by gje21c (Acolyte)
on Nov 23, 2010 at 05:56 UTC ( #873113=note: print w/ replies, xml ) Need Help??


in reply to Efficient walk/iterate along a string

Hi

Thanks for all those suggestions, I have implemented chop so far and got a solid 25% saving, even with overhead of reverse. Very worthwhile. Program run time is 5 minutes as one-off, 45 minutes in a full test suite, so 25% is good saving

The responders who mentioned I didn't give much context are of course correct, but I chose brevity. I have all sorts of requirements re scanning the protein strings, and I'm happy with my regexp scans, index's, other such things that use the structure of a ~20 character peptide sequence within the ~400 character protein, those are pretty ok and are not topping the list in the FastProf profile, just the areas where I do a dumb-as-hell one character at a time scan. These inlcude things like scanning all the proteins and looking for an essentially arbitrary logic in the strings, ie. break the protein (as per Trypsin protease) at a K or R, and count the S,T.Y residues in each segment. I don't think there's any way to speed that up with pre-processing and clever mapping and saving results for re-use, you just have to come up with the world's best way to scan down strings one char at a time. That's my take aayway.

I'll keep looking at your replies though, thanks again.


Comment on Re: Efficient walk/iterate along a string
Re^2: Efficient walk/iterate along a string
by GrandFather (Cardinal) on Nov 23, 2010 at 06:52 UTC

    If I understand the Trypsin example (I have no knowledge of this problem domain) the following code leverages Perl's string handling to produce the results I think you are looking for in this narrow case.

    use strict; use warnings; my $str = 'GVGFTVILISFKYVGFFYNVIIAWALHYFFSSFTMDLPRWIHCNNTWNSPNCSDAHASN +SSDGLGLNDTFGTTPAAEYFER'; my @segments = map {[$_, (scalar tr/STY//)]} split /(?<=[KR])/, $str; printf "%45s: %d\n", @$_ for @segments;

    Prints:

    GVGFTVILISFK: 2 YVGFFYNVIIAWALHYFFSSFTMDLPR: 6 WIHCNNTWNSPNCSDAHASNSSDGLGLNDTFGTTPAAEYFER: 10

    I've no idea how the performance may compare with whatever you are doing now, but if this does anything like the task you need then I'm sure it will be an order of magnitude faster than a character at a time implementation.

    True laziness is hard work
      Monks

      I tried the split/tr/map approach suggested by Grandfather and got ~ 60% reduction in time, great ! (and yes you got my app requirement exactly right)

      I think should withdraw my words that there's nothing to be gained from the special functions when you're just marching down strings and doing a seemingly arbitrary logic function (and also my words about inadequate context). Splitting at simple marker points and counting a simple set of chars within the splits, combined with Perl's superb list handling, is not arbitrary spaghetti logic. Lesson learned.

      One query: the ?<= construct was new to me, I've played with it and read up on it (perlre/Extended Patterns/Look-Around Assertions) and I'm 80% there but if anyone could explain exactly how this RE is ensuring I get exactly what I need I'd be grateful.

      Will continue to check some other posts, thanks all

      GJE

        I'm disappointed at only a 60% improvement! Maybe the segments are shorter than I would have guessed. This technique would give the best improvement for long strings containing few segments.

        I used a look behind in the split regular expression to retain the matched character at the end of the previous segment. Use a [KR] match to drop the 'K' or 'R', or use a look ahead match (?=[KR]) to retain the matched character at the start of the segment.

        Look behind and look ahead are anchors, they don't "consume" the matched characters, they just mark a position where some match condition is true. In this case we are testing for "was the last character R or K?".

        True laziness is hard work
      Grandfather,

      Hope I've got the techno details and the "protocol" right here.

      I replied to your repy on my main thread as below. Appreciate if you could elucidate (love that word!) the regexp as per my words below. Various explorations with variants on ?<= give me the protein string split on the KR characters, with or without null strings, strings of just K or R, strings where consective K/R were condensed to just one, or none. I'm sure I can work it out with about 5 hours of exploration, but if you could just put into words what the ?<= is doibg I'd ber grateful.

      Thanks, GE.

      Cheers, GE

      -------------

      Monks
      I tried the split/tr/map approach suggested by Grandfather and got ~ 60% reduction in time, great ! (and yes you got my app requirement exactly right)

      I think should withdraw my words that there's nothing to be gained from the special functions when you're just marching down strings and doing a seemingly arbitrary logic function (and also my words about inadequate context). Splitting at simple marker points and counting a simple set of chars within the splits, combined with Perl's superb list handling, is not arbitrary spaghetti logic. Lesson learned.

      One query: the ?<= construct was new to me, I've played with it and read up on it (perlre/Extended Patterns/Look-Around Assertions) and I'm 80% there but if anyone could explain exactly how this RE is ensuring I get exactly what I need I'd be grateful.

      Will continue to check some other posts, thanks all

      GJE

      20101206 Janitored by Corion: Added formatting, code tags, as per Writeup Formatting Tips

Re^2: Efficient walk/iterate along a string
by LanX (Canon) on Nov 25, 2010 at 02:27 UTC
    IMHO you can do most of the work in one single regex with nested groupings, unfortunately I don't have the time until next week to fiddle out how exactly.

    But most probably you'll get a faster response, since this is the monastery ... :)

    Cheers Rolf

    PS: but no guaranty that it's faster...

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://873113]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others taking refuge in the Monastery: (13)
As of 2014-07-25 17:32 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (174 votes), past polls