Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:
Hello
I cannot get the following code to capture what is in a lookahead (when the lookahead is within capturing parens. .
What I'm trying to achieve is to capture substrings of length 'x' beginning with the first character in the fasta sequence (and the lookahead portion) progressing from the first character, the second, and so on, until the substrings run up against the end of the characters. Thanks in advance for any help. Chris
#!/usr/bin/perl
use strict;
use warnings;
my $str = do {local $/; <DATA>};
$str =~ s/\s+//g;
my $len = length $str;
while (--$len > 780) {
while ($str =~ /(.(?=.{$len}))/g) {
print pos, "$1\n";
}
pos = 0;
}
__DATA__
CAATATGCGAGGGACCTACATGTTGAGCATGACAATGAATTCTATTGAAAAAAAGAGTTG
GAAGTATATACGAATATAAATAATGTGAAACAAAAGAAGAAAAGTGAATAAAAGGCACTT
AAGACGCTATCCAATTGTGTATGAGAAGTGCAAACTCAATTTTTTTGCAAAAGACTTTTC
TCAAACCTTTAATTGATGCGTTCCGTTCATACATCAACCGCGTCCCATACGTTTCGAAGA
AAATTGCCTATATGTGTTATTTACATACTAGAGATATTTTAATATTGAAACAGTTGTATT
TCTATTGTAATTACTATCTAAACGTCTCGTCCCACCGCTGTTGATAAAGCGGTGCCAATA
TAATTTATAATCACGCCTGGGTAAATGACTTTTAATTTCTTAAATCATCGAAGTATGCGA
AAACAAGAAGTCTTTATTCATAATAAAAAACAAATTCGGTTACTACGACTTTTATATGTC
ATTCAATATTTGGTAATAATTTTACAGTATTAACCCTATCCTCATCTGATTCACTCTCTT
CTAATTGCATATATTTTCAAATTCGCTTTCAGCGCTGTACAAAACCAGTAAGCAGATCTC
GTACGAAGACTCAAATAAGTTGCATTGTTCGTATTCAAGGAAACCGGGGGGCAAAATTTC
CAACCATATTTAAGTATGACAATATTTCCAAGTCAAGGATGCATGCTGTTCTTCTCTTCA
TTAACTAGCTAACCAATTAGCTGAACGGCTTTGTATTTTACTTAACATATTGTCTATTGC
ATAAAAAACCACTATTCAGC
Re: Capture Lookahead
by BrowserUk (Patriarch) on Jul 26, 2005 at 02:20 UTC
|
#!/usr/bin/perl
use strict;
use warnings;
my $str = do {local $/; <DATA>};
$str =~ s/\s+//g;
my $len = length $str;
while (--$len > 780) {
printf "%3d : %s\n", $_, substr( $str, $_, $len )
for 1 .. ( length( $str ) - $len );
}
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
"Science is about questioning the status quo. Questioning authority".
The "good enough" maybe good enough for the now, and perfection maybe unobtainable, but that should not preclude us from striving for perfection, when time, circumstance or desire allow.
| [reply] [d/l] |
|
#!/usr/bin/perl
use strict;
use warnings;
my $str = do {local $/; <DATA>};
$str =~ s/\s+//g;
my $len = length $str;
do {
printf "%3d : %s\n", $_, substr( $str, $_, $len )
for 0 .. ( length( $str ) - $len );
} while (--$len > 3);
| [reply] [d/l] |
|
#!/usr/bin/perl
use strict;
use warnings;
my $str = do {local $/; <DATA>};
$str =~ s/\s+//g;
while ( $str =~ m/( (..+) .? (??{ reverse $2 }) )/xgc )
{
print pos( $str ) . ": $1\n";
pos( $str ) = $-[0] + 1; # slide pos back to the left
}
prints:
14: AGGGA
21: TACAT
25: GTTG
55: GAAAAAAAG
...etc...
I'm not sure how it would compare with the two-stage approach for speed, though. It is much faster if you minimize the qualifier ..+?, but then you end up with the shortest palindrome at each position, rather than the longest.
I make the assumption that you don't care about palindromes shorter than 4 characters. If you bump that upwards, things get faster.
Update: I tested, and it looks like the substr approach is considerably faster, particularly if you do it in a single pass. | [reply] [d/l] [select] |
Re: Capture Lookahead
by Roy Johnson (Monsignor) on Jul 26, 2005 at 02:27 UTC
|
Putting the capturing parens outside the lookahead captures only the characters actually consumed. Putting them inside the lookahead will capture what the lookahead would have consumed (had it not been zero-width). So you'd have to do something like
while ($str =~ /(?=(..{$len}))./g) {
print pos, "$1\n";
}
That reads the whole desired capture as a lookahead, then consumes the one char.
Caution: Contents may have been coded under pressure.
| [reply] [d/l] |
Re: Capture Lookahead
by GrandFather (Saint) on Jul 26, 2005 at 01:56 UTC
|
I can't see what you are trying to achieve and I don't know what a "fasta sequence" is. Can you give a sample of the output that you would expect?
In the mean time, you probably mext pos ($str) rather than simply pos to manipulate the last capture pos for $str. I suspect also that you don't actually want the lookahead, it should be just a match.
Remember that a look ahead assertion is a 0 width match.
Perl is Huffman encoded by design.
| [reply] [d/l] [select] |
|
ABCDEF
BCDEFG
CDEFGH
DEFGHI
EFGHIJ
FGHIJK
Further, on every pass, I want to decrease the length of the strings to be captured by 1 until the outer while loop bails out.
Thanks for the note on my error concerning pos $str.
Hope this makes the problem clearer.
Chris | [reply] |
|
| [reply] [d/l] [select] |
Re: Capture Lookahead
by anonymized user 468275 (Curate) on Jul 26, 2005 at 07:51 UTC
|
Printing all possible substrings is what you asked for and what the replies have delivered. But I do wonder if that can be what you need, especially given that this will generate many duplicates. If you tell us more globally what the goal is, there might be a more efficient solution to the main problem we can find. For example, if you are next going to compare all these substrings with another list of strings, then it would be best to match them after all.
| [reply] |
|
I was too hasty in my assessment of the problem :-(
Yes, there are many unanswered questions - do I want to allow overlapping palindromes?, most likely don't want to allow them when they are 'shortened' versions of a larger one. And the algorithm, (generating all possible subtrings to examine), which is very time consuming, may not be the right way to go. There may even be a module on CPAN that already does this (there is a module that does palindromes for fasta strings) and it does answer some of the questions about overlapping and others as well. Thanks again to everyone who responded. And to Roy Johnston for the regex solution - very nice!
Chris
Yes, fishbot does it directly. I didn't know how to use the ??{ } construct - also forgot about the @- array for the position.
| [reply] |
Re: Capture Lookahead
by inman (Curate) on Jul 26, 2005 at 09:21 UTC
|
#! /usr/bin/perl
#
use strict;
use warnings;
use re 'eval';
my $str = do {local $/; <DATA>};
$str =~ s/\s+//g;
for (my $len = length $str; $len > 1; $len-- ) {
$str =~ /(.{$len})(?{print "$1\n"})\b/;
}
| [reply] [d/l] |
|
|