http://www.perlmonks.org?node_id=478043

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hello

I cannot get the following code to capture what is in a lookahead (when the lookahead is within capturing parens. .

What I'm trying to achieve is to capture substrings of length 'x' beginning with the first character in the fasta sequence (and the lookahead portion) progressing from the first character, the second, and so on, until the substrings run up against the end of the characters.

Thanks in advance for any help.

Chris

#!/usr/bin/perl use strict; use warnings; my $str = do {local $/; <DATA>}; $str =~ s/\s+//g; my $len = length $str; while (--$len > 780) { while ($str =~ /(.(?=.{$len}))/g) { print pos, "$1\n"; } pos = 0; } __DATA__ CAATATGCGAGGGACCTACATGTTGAGCATGACAATGAATTCTATTGAAAAAAAGAGTTG GAAGTATATACGAATATAAATAATGTGAAACAAAAGAAGAAAAGTGAATAAAAGGCACTT AAGACGCTATCCAATTGTGTATGAGAAGTGCAAACTCAATTTTTTTGCAAAAGACTTTTC TCAAACCTTTAATTGATGCGTTCCGTTCATACATCAACCGCGTCCCATACGTTTCGAAGA AAATTGCCTATATGTGTTATTTACATACTAGAGATATTTTAATATTGAAACAGTTGTATT TCTATTGTAATTACTATCTAAACGTCTCGTCCCACCGCTGTTGATAAAGCGGTGCCAATA TAATTTATAATCACGCCTGGGTAAATGACTTTTAATTTCTTAAATCATCGAAGTATGCGA AAACAAGAAGTCTTTATTCATAATAAAAAACAAATTCGGTTACTACGACTTTTATATGTC ATTCAATATTTGGTAATAATTTTACAGTATTAACCCTATCCTCATCTGATTCACTCTCTT CTAATTGCATATATTTTCAAATTCGCTTTCAGCGCTGTACAAAACCAGTAAGCAGATCTC GTACGAAGACTCAAATAAGTTGCATTGTTCGTATTCAAGGAAACCGGGGGGCAAAATTTC CAACCATATTTAAGTATGACAATATTTCCAAGTCAAGGATGCATGCTGTTCTTCTCTTCA TTAACTAGCTAACCAATTAGCTGAACGGCTTTGTATTTTACTTAACATATTGTCTATTGC ATAAAAAACCACTATTCAGC

Replies are listed 'Best First'.
Re: Capture Lookahead
by BrowserUk (Patriarch) on Jul 26, 2005 at 02:20 UTC

    Like this?

    #!/usr/bin/perl use strict; use warnings; my $str = do {local $/; <DATA>}; $str =~ s/\s+//g; my $len = length $str; while (--$len > 780) { printf "%3d : %s\n", $_, substr( $str, $_, $len ) for 1 .. ( length( $str ) - $len ); }

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
    "Science is about questioning the status quo. Questioning authority".
    The "good enough" maybe good enough for the now, and perfection maybe unobtainable, but that should not preclude us from striving for perfection, when time, circumstance or desire allow.
      Yes and thanks! I made a slight adjustment. The true reason for generating these substrings is to see if there are any palindromes and I need to check all possible substrings of the fasta string (here 800 chars). This code is not checking for palindromes (but is trivial to add that). I just needed to get the munging part correct. (Just to run the program as is created an 86 MB file, which I won't be doing - just printing out the palindromes instead). But, it does take some time and with a larger fasta string, may take a while to just test every substring.

      Thanks everyone. I'll be working at this for a while now. The reason I am doing this is because I saw that Mathmatica does a palindrome check in pretty terse terms (someone had a link to Mathmatica here in the Monks a few days ago).

      Chris

      #!/usr/bin/perl use strict; use warnings; my $str = do {local $/; <DATA>}; $str =~ s/\s+//g; my $len = length $str; do { printf "%3d : %s\n", $_, substr( $str, $_, $len ) for 0 .. ( length( $str ) - $len ); } while (--$len > 3);

        If you want to find palindromes, why not do it with a regex directly?

        #!/usr/bin/perl use strict; use warnings; my $str = do {local $/; <DATA>}; $str =~ s/\s+//g; while ( $str =~ m/( (..+) .? (??{ reverse $2 }) )/xgc ) { print pos( $str ) . ": $1\n"; pos( $str ) = $-[0] + 1; # slide pos back to the left }

        prints:

        14: AGGGA 21: TACAT 25: GTTG 55: GAAAAAAAG ...etc...

        I'm not sure how it would compare with the two-stage approach for speed, though. It is much faster if you minimize the qualifier ..+?, but then you end up with the shortest palindrome at each position, rather than the longest.

        I make the assumption that you don't care about palindromes shorter than 4 characters. If you bump that upwards, things get faster.

        Update: I tested, and it looks like the substr approach is considerably faster, particularly if you do it in a single pass.

Re: Capture Lookahead
by Roy Johnson (Monsignor) on Jul 26, 2005 at 02:27 UTC
    Putting the capturing parens outside the lookahead captures only the characters actually consumed. Putting them inside the lookahead will capture what the lookahead would have consumed (had it not been zero-width). So you'd have to do something like
    while ($str =~ /(?=(..{$len}))./g) { print pos, "$1\n"; }
    That reads the whole desired capture as a lookahead, then consumes the one char.

    Caution: Contents may have been coded under pressure.
Re: Capture Lookahead
by GrandFather (Saint) on Jul 26, 2005 at 01:56 UTC

    I can't see what you are trying to achieve and I don't know what a "fasta sequence" is. Can you give a sample of the output that you would expect?

    In the mean time, you probably mext pos ($str) rather than simply pos to manipulate the last capture pos for $str. I suspect also that you don't actually want the lookahead, it should be just a match.

    Remember that a look ahead assertion is a 0 width match.


    Perl is Huffman encoded by design.
      First, I must say I think I've seen what I want to achieve here in some other post but I don't know how to find it now - was posted several months ago.

      An example would be:

      String of: ABCDEFGHIJK

      and, on one pass, I would like to capture the substrings (of some arbitrary length)

      ABCDEF
      BCDEFG
      CDEFGH
      DEFGHI
      EFGHIJ
      FGHIJK
      
      Further, on every pass, I want to decrease the length of the strings to be captured by 1 until the outer while loop bails out.

      Thanks for the note on my error concerning pos $str.

      Hope this makes the problem clearer.

      Chris

        In that case you don't want a regex at all. Use substr.


        Perl is Huffman encoded by design.
Re: Capture Lookahead
by anonymized user 468275 (Curate) on Jul 26, 2005 at 07:51 UTC
    Printing all possible substrings is what you asked for and what the replies have delivered. But I do wonder if that can be what you need, especially given that this will generate many duplicates. If you tell us more globally what the goal is, there might be a more efficient solution to the main problem we can find. For example, if you are next going to compare all these substrings with another list of strings, then it would be best to match them after all.

    One world, one people

      I was too hasty in my assessment of the problem :-(

      Yes, there are many unanswered questions - do I want to allow overlapping palindromes?, most likely don't want to allow them when they are 'shortened' versions of a larger one. And the algorithm, (generating all possible subtrings to examine), which is very time consuming, may not be the right way to go. There may even be a module on CPAN that already does this (there is a module that does palindromes for fasta strings) and it does answer some of the questions about overlapping and others as well.

      Thanks again to everyone who responded. And to Roy Johnston for the regex solution - very nice!

      Chris Yes, fishbot does it directly. I didn't know how to use the ??{ } construct - also forgot about the @- array for the position.

Re: Capture Lookahead
by inman (Curate) on Jul 26, 2005 at 09:21 UTC
    Another regex solution.
    #! /usr/bin/perl # use strict; use warnings; use re 'eval'; my $str = do {local $/; <DATA>}; $str =~ s/\s+//g; for (my $len = length $str; $len > 1; $len-- ) { $str =~ /(.{$len})(?{print "$1\n"})\b/; }