Capture Lookahead

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hello

I cannot get the following code to capture what is in a lookahead (when the lookahead is within capturing parens. .

What I'm trying to achieve is to capture substrings of length 'x' beginning with the first character in the fasta sequence (and the lookahead portion) progressing from the first character, the second, and so on, until the substrings run up against the end of the characters.

Thanks in advance for any help.

Chris

#!/usr/bin/perl
use strict;
use warnings;

my $str = do {local $/; <DATA>};
$str =~ s/\s+//g;
my $len = length $str;

while (--$len > 780) {
    while ($str =~ /(.(?=.{$len}))/g) {
        print pos, "$1\n";
    }
    pos = 0;
}

__DATA__
CAATATGCGAGGGACCTACATGTTGAGCATGACAATGAATTCTATTGAAAAAAAGAGTTG
GAAGTATATACGAATATAAATAATGTGAAACAAAAGAAGAAAAGTGAATAAAAGGCACTT
AAGACGCTATCCAATTGTGTATGAGAAGTGCAAACTCAATTTTTTTGCAAAAGACTTTTC
TCAAACCTTTAATTGATGCGTTCCGTTCATACATCAACCGCGTCCCATACGTTTCGAAGA
AAATTGCCTATATGTGTTATTTACATACTAGAGATATTTTAATATTGAAACAGTTGTATT
TCTATTGTAATTACTATCTAAACGTCTCGTCCCACCGCTGTTGATAAAGCGGTGCCAATA
TAATTTATAATCACGCCTGGGTAAATGACTTTTAATTTCTTAAATCATCGAAGTATGCGA
AAACAAGAAGTCTTTATTCATAATAAAAAACAAATTCGGTTACTACGACTTTTATATGTC
ATTCAATATTTGGTAATAATTTTACAGTATTAACCCTATCCTCATCTGATTCACTCTCTT
CTAATTGCATATATTTTCAAATTCGCTTTCAGCGCTGTACAAAACCAGTAAGCAGATCTC
GTACGAAGACTCAAATAAGTTGCATTGTTCGTATTCAAGGAAACCGGGGGGCAAAATTTC
CAACCATATTTAAGTATGACAATATTTCCAAGTCAAGGATGCATGCTGTTCTTCTCTTCA
TTAACTAGCTAACCAATTAGCTGAACGGCTTTGTATTTTACTTAACATATTGTCTATTGC
ATAAAAAACCACTATTCAGC
[download]

Comment on Capture Lookahead Download Code

Replies are listed 'Best First'.
Re: Capture Lookahead by BrowserUk (Patriarch) on Jul 26, 2005 at 02:20 UTC
Like this? `#!/usr/bin/perl use strict; use warnings; my $str = do {local $/; <DATA>}; $str =~ s/\s+//g; my $len = length $str; while (--$len > 780) { printf "%3d : %s\n", $_, substr( $str, $_, $len ) for 1 .. ( length( $str ) - $len ); }` [download] Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal? "Science is about questioning the status quo. Questioning authority". The "good enough" maybe good enough for the now, and perfection maybe unobtainable, but that should not preclude us from striving for perfection, when time, circumstance or desire allow.	[reply] [d/l]
Re^2: Capture Lookahead by Cristoforo (Curate) on Jul 26, 2005 at 02:57 UTC
Yes and thanks! I made a slight adjustment. The true reason for generating these substrings is to see if there are any palindromes and I need to check all possible substrings of the fasta string (here 800 chars). This code is not checking for palindromes (but is trivial to add that). I just needed to get the munging part correct. (Just to run the program as is created an 86 MB file, which I won't be doing - just printing out the palindromes instead). But, it does take some time and with a larger fasta string, may take a while to just test every substring. Thanks everyone. I'll be working at this for a while now. The reason I am doing this is because I saw that Mathmatica does a palindrome check in pretty terse terms (someone had a link to Mathmatica here in the Monks a few days ago). Chris `#!/usr/bin/perl use strict; use warnings; my $str = do {local $/; <DATA>}; $str =~ s/\s+//g; my $len = length $str; do { printf "%3d : %s\n", $_, substr( $str, $_, $len ) for 0 .. ( length( $str ) - $len ); } while (--$len > 3);` [download]	[reply] [d/l]
Re^3: Capture Lookahead by fishbot_v2 (Chaplain) on Jul 26, 2005 at 11:46 UTC
If you want to find palindromes, why not do it with a regex directly? `#!/usr/bin/perl use strict; use warnings; my $str = do {local $/; <DATA>}; $str =~ s/\s+//g; while ( $str =~ m/( (..+) .? (??{ reverse $2 }) )/xgc ) { print pos( $str ) . ": $1\n"; pos( $str ) = $-[0] + 1; # slide pos back to the left }` [download] prints: `14: AGGGA 21: TACAT 25: GTTG 55: GAAAAAAAG ...etc...` [download] I'm not sure how it would compare with the two-stage approach for speed, though. It is much faster if you minimize the qualifier `..+?`, but then you end up with the shortest palindrome at each position, rather than the longest. I make the assumption that you don't care about palindromes shorter than 4 characters. If you bump that upwards, things get faster. Update: I tested, and it looks like the `substr` approach is considerably faster, particularly if you do it in a single pass.	[reply] [d/l] [select]
Re: Capture Lookahead by Roy Johnson (Monsignor) on Jul 26, 2005 at 02:27 UTC
Putting the capturing parens outside the lookahead captures only the characters actually consumed. Putting them inside the lookahead will capture what the lookahead would have consumed (had it not been zero-width). So you'd have to do something like `while ($str =~ /(?=(..{$len}))./g) { print pos, "$1\n"; }` [download] That reads the whole desired capture as a lookahead, then consumes the one char. Caution: Contents may have been coded under pressure.	[reply] [d/l]
Re: Capture Lookahead by GrandFather (Saint) on Jul 26, 2005 at 01:56 UTC
I can't see what you are trying to achieve and I don't know what a "fasta sequence" is. Can you give a sample of the output that you would expect? In the mean time, you probably mext `pos ($str)` rather than simply `pos` to manipulate the last capture pos for `$str`. I suspect also that you don't actually want the lookahead, it should be just a match. Remember that a look ahead assertion is a 0 width match. Perl is Huffman encoded by design.	[reply] [d/l] [select]
Re^2: Capture Lookahead by Anonymous Monk on Jul 26, 2005 at 02:11 UTC
First, I must say I think I've seen what I want to achieve here in some other post but I don't know how to find it now - was posted several months ago. An example would be: String of: ABCDEFGHIJK and, on one pass, I would like to capture the substrings (of some arbitrary length) ABCDEF BCDEFG CDEFGH DEFGHI EFGHIJ FGHIJK Further, on every pass, I want to decrease the length of the strings to be captured by 1 until the outer while loop bails out. Thanks for the note on my error concerning pos $str. Hope this makes the problem clearer. Chris	[reply]
Re^3: Capture Lookahead by GrandFather (Saint) on Jul 26, 2005 at 02:25 UTC
In that case you don't want a regex at all. Use substr. Read more... Spoiler - sample code (465 Bytes) Read more... output (376 Bytes) Perl is Huffman encoded by design.	[reply] [d/l] [select]
Re: Capture Lookahead by anonymized user 468275 (Curate) on Jul 26, 2005 at 07:51 UTC
Printing all possible substrings is what you asked for and what the replies have delivered. But I do wonder if that can be what you need, especially given that this will generate many duplicates. If you tell us more globally what the goal is, there might be a more efficient solution to the main problem we can find. For example, if you are next going to compare all these substrings with another list of strings, then it would be best to match them after all. One world, one people	[reply]
Re^2: Capture Lookahead by Cristoforo (Curate) on Jul 26, 2005 at 16:47 UTC
I was too hasty in my assessment of the problem :-( Yes, there are many unanswered questions - do I want to allow overlapping palindromes?, most likely don't want to allow them when they are 'shortened' versions of a larger one. And the algorithm, (generating all possible subtrings to examine), which is very time consuming, may not be the right way to go. There may even be a module on CPAN that already does this (there is a module that does palindromes for fasta strings) and it does answer some of the questions about overlapping and others as well. Thanks again to everyone who responded. And to Roy Johnston for the regex solution - very nice! Chris Yes, fishbot does it directly. I didn't know how to use the ??{ } construct - also forgot about the @- array for the position.	[reply]
Re: Capture Lookahead by inman (Curate) on Jul 26, 2005 at 09:21 UTC
Another regex solution. `#! /usr/bin/perl # use strict; use warnings; use re 'eval'; my $str = do {local $/; <DATA>}; $str =~ s/\s+//g; for (my $len = length $str; $len > 1; $len-- ) { $str =~ /(.{$len})(?{print "$1\n"})\b/; }` [download]	[reply] [d/l]

Back to Seekers of Perl Wisdom