http://www.perlmonks.org?node_id=687744

heidi has asked for the wisdom of the Perl Monks concerning the following question:

hi all, I have a sequence like this :
APADPKGSTIDRPDAARTLTVHKCEQTDTRGVKEGTRNEDPQAECKPVSDVEFTITKLNVD
I want to cleave this sequence at the regions of "K" (or) "R" but they should not be present before "P". when i split it with either K or R, the K and R alphabet disappears. So, how do i split and also retain the alphabet? Finally, the list of fragments should be stored in an array. pls help. thanks :)

Replies are listed 'Best First'.
Re: cleaving a sequence with specific alphabets
by mwah (Hermit) on May 21, 2008 at 11:09 UTC

    Its not entirely clear for me what you try to do.

    ... my @s = split / (?<=[KR]) # split on K,R (?!P) # but not if in front of a P /x, 'APADPKGSTIDRPDAARTLTVHKCEQTDTRGVKEGTRNEDPQAECKPVSDV +EFTITKLNVD'; print join "\n", @s; ...

    The above would be my first guess here.

    Regards

    mwa

      Probably not as good as the split solution but perhaps a little easier to read if you recognise the idiom:

      @arr = $string =~ m/(.*?[KR])(?!P)/gs;

        You won't need capturing parentheses when evaluating in list context. Afaik does the .*? invoke a speed penalty, so that the expression might be optimized as:

        ... my @arr = $string =~ / [^KR]+ # collect non K|R . # the following must be K|R (?!P) # ignore if fragment would start by P /gsx; ...

        Regards

        mwa

Re: cleaving a sequence with specific alphabets
by grizzley (Chaplain) on May 21, 2008 at 11:07 UTC

    I don't understand. Can you say which of the following cases are good candidates to divide string?

    K R
    K P
    K any_other_char
    P K
    P R
    P any_char
    R K
    R P
    R any_char

    I think look-ahead assertion will be the solution for your problem, but please asnwer above question first.

Re: cleaving a sequence with specific alphabets
by BrowserUk (Patriarch) on May 21, 2008 at 11:21 UTC

    Like this (before a K or R not followed by P)

    print for split '(?=[KR][^P])', 'APADPKGSTIDRPDAARTLTVHKCEQTDTRGVKEGTRNEDPQAECKPVSDVEFTITKLNVD';; APADP KGSTIDRPDAA RTLTVH KCEQTDT RGV KEGT RNEDPQAECKPVSDVEFTIT KLNVD

    Or maybe like this (After a K or R not followed by P):

    print for split '(?<=[KR](?!P))', 'APADPKGSTIDRPDAARTLTVHKCEQTDTRGVKEGTRNEDPQAECKPVSDVEFTITKLNVD';; APADPK GSTIDRPDAAR TLTVHK CEQTDTR GVK EGTR NEDPQAECKPVSDVEFTITK LNVD

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: cleaving a sequence with specific alphabets
by prasadbabu (Prior) on May 21, 2008 at 11:16 UTC

    If I understood your question clearly, here is my solution. You can use negative look behind regex and split function to accomplsh it. TIMTOWTDI.

    $string = 'APADPKGSTIDRPDAARTLTVHKCEQTDTRGVKEGTRNEDPQAECKPVSDVEFTITKLN +VD'; @arr = split /(?<!P)(K|R)/, $string; print @arr; output: ------- APADPKGSTIDRPDAARTLTVHKCEQTDTRGVKEGTRNEDPQAECKPVSDVEFTITKLNVD

    Take a look at perlre and split function.

    Prasad