Okay, but I'm not sure whether you can safetly discount the possibility of long sequences of repeated characters, even in biodata. The following shows a scan of the complete drosophila (fruit fly) genome looking for sequences of repeated DNA characters:
>perl -nlwe" print $1 while m[((.)\2+)]g" na_clones.dros.RELEASE2.5 |
+(
More? perl -nle"$b{ length($_) }{ chop $_}++ }
{ printf qq[length %d :[ A:%d C:%d G:%d T:%d ]\n],
$_, @{$b{$_}}{'A','C','G','T'} for sort{$b<=>$a} keys %b" )
length 50 :[ A:1 C:0 G:0 T:0 ]
length 47 :[ A:1 C:0 G:0 T:0 ]
length 45 :[ A:1 C:0 G:0 T:0 ]
length 44 :[ A:0 C:0 G:0 T:0 ]
length 43 :[ A:0 C:0 G:0 T:1 ]
length 42 :[ A:1 C:0 G:0 T:0 ]
length 41 :[ A:1 C:0 G:0 T:1 ]
length 40 :[ A:1 C:0 G:0 T:0 ]
length 39 :[ A:0 C:0 G:0 T:0 ]
length 38 :[ A:1 C:0 G:0 T:0 ]
length 37 :[ A:2 C:0 G:0 T:0 ]
length 36 :[ A:5 C:0 G:0 T:4 ]
length 35 :[ A:5 C:0 G:0 T:3 ]
length 34 :[ A:2 C:0 G:0 T:3 ]
length 33 :[ A:2 C:0 G:0 T:3 ]
length 32 :[ A:1 C:0 G:0 T:1 ]
length 31 :[ A:5 C:0 G:0 T:5 ]
length 30 :[ A:4 C:0 G:0 T:5 ]
length 29 :[ A:10 C:0 G:0 T:6 ]
length 28 :[ A:12 C:0 G:0 T:13 ]
length 27 :[ A:19 C:0 G:0 T:12 ]
length 26 :[ A:26 C:0 G:0 T:28 ]
length 25 :[ A:35 C:1 G:1 T:34 ]
length 24 :[ A:48 C:0 G:0 T:47 ]
length 23 :[ A:63 C:0 G:1 T:59 ]
length 22 :[ A:78 C:1 G:0 T:84 ]
length 21 :[ A:109 C:5 G:3 T:91 ]
length 20 :[ A:126 C:5 G:6 T:144 ]
length 19 :[ A:188 C:17 G:15 T:148 ]
length 18 :[ A:240 C:25 G:26 T:314 ]
length 17 :[ A:411 C:45 G:55 T:389 ]
length 16 :[ A:647 C:73 G:78 T:656 ]
length 15 :[ A:905 C:122 G:133 T:886 ]
length 14 :[ A:1267 C:191 G:216 T:1208 ]
length 13 :[ A:1813 C:255 G:260 T:1805 ]
length 12 :[ A:2702 C:353 G:353 T:2800 ]
length 11 :[ A:4209 C:380 G:404 T:4184 ]
length 10 :[ A:7181 C:446 G:428 T:7343 ]
length 9 :[ A:9964 C:386 G:437 T:10011 ]
length 8 :[ A:14581 C:699 G:601 T:14514 ]
length 7 :[ A:28336 C:2656 G:2644 T:28782 ]
length 6 :[ A:90274 C:11644 G:11543 T:89663 ]
length 5 :[ A:308601 C:52038 G:52017 T:309011 ]
length 4 :[ A:889188 C:199827 G:200919 T:885567 ]
length 3 :[ A:2424590 C:938297 G:935629 T:2422964 ]
length 2 :[ A:6232754 C:4721693 G:4718156 T:6219367 ]
As you can see, at lengths over 16, they are still a fairly frequent occurence. And even at lengths > 32, there are still enough to be worrying.
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
"Science is about questioning the status quo. Questioning authority".
The "good enough" maybe good enough for the now, and perfection maybe unobtainable, but that should not preclude us from striving for perfection, when time, circumstance or desire allow.
| [reply] [d/l] |
The down side is that it is not only a single character repeated ('AAAA'), but short repeating sequences ('ACTACTACT') that can be missed or truncated. The up side is that for bioMan's problem a minimum match quanta of 128 is probably optimum and I'd guess that that is long enough to be unlikely to be a problem.
At this time I've not thought of a fast way of dealing with the issue and am somewhat inclined to ignore it unless someone can convince me that this is really useful code, but needs this bug fixed.
Perl is Huffman encoded by design.
| [reply] |
... for bioMan's problem a minimum match quanta of 128 is probably optimum and I'd guess that that is long enough to be unlikely to be a problem.
Seems to be. Scanning for repeating sequences of 2, 3 & 4 characters, none was longer then 50 chars, so a minimum quanta of 64 would also probably be possible.
inclined to ignore it unless someone can convince me that this is really useful
I understand that totally. I ended up resorting to Inline C to get speed because every attempt to improved the performance of my perl versions ended up missing things.
Shame though. Your technique is so very fast for a pure perl solution it would be a real coup if it could be generalised.
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
"Science is about questioning the status quo. Questioning authority".
The "good enough" maybe good enough for the now, and perfection maybe unobtainable, but that should not preclude us from striving for perfection, when time, circumstance or desire allow.
| [reply] |