Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

Re: homemade blat: need help with reverse string.

by Cristoforo (Curate)
on Dec 15, 2015 at 17:48 UTC ( [id://1150400]=note: print w/replies, xml ) Need Help??


in reply to homemade blat: need help with reverse string.

This seems to be similar to what you asked here Question: homemade blat: insert query from keyboard to find an oligo in an hashtable.

If I understand your problem, this program would be a solution.

#!/usr/bin/perl use strict; use warnings; print "Insert query:\n"; chomp(my $query = uc <STDIN>); print "\n"; my $nomefile = $ARGV[0]; open my $f, '<', $nomefile or die "Unable to open $nomefile. $!"; chomp(my $id_line = <$f>); die "Input file $nomefile not in FASTA format\n" if substr($id_line,0,1) ne ">"; my $sequenza; while (my $line = <$f>) { chomp $line; if (substr($line, 0, 1) eq '>') { report_matches($id_line, $sequenza, $query); $sequenza = ''; $id_line = $line; } else { $sequenza .= uc $line; } # necessary to report last record if eof report_matches($id_line, $sequenza, $query) if eof; } close $f or die "Unable to close $nomefile. $!"; sub report_matches { my ($id, $sequenza, $query) = @_; my %pos; my $revcomp = reverse $query =~ tr/ATCG/TAGC/r; my $len_query = length $query; for my $i (0 .. length($sequenza)-$len_query) { my $piece = substr $sequenza, $i, $len_query; if ($piece eq $query) { push @{ $pos{q} }, $i; } elsif ($piece eq $revcomp) { push @{ $pos{rc} }, $i; } } print "For id: $id\n"; if ($pos{q}) { printf "QUERY: $query\tappears %d times\t position @{ $pos{q} +}\n", scalar @{ $pos{q} }; } if ($pos{rc}) { printf "REVCOMP: $revcomp\tappears %d times\tposition @{ $pos{ +rc} }\n", scalar @{ $pos{rc} }; } print "\n"; }
Some output from a fasta file.
C:\Old_Data\perlp>perl test3.pl fasta_dat.txt Insert query: ttttt For id: >NR_037701 1 QUERY: TTTTT appears 6 times position 615 680 1114 1374 1428 2322 REVCOMP: AAAAA appears 47 times position 1640 2010 3304 3305 3 +306 3307 3308 3309 3 310 3311 3312 3313 3314 3315 3316 3317 3318 3319 3320 3321 3322 3323 3 +324 3325 3326 3327 3 328 3329 3330 3331 3332 3333 3334 3335 3336 3337 3338 3339 3340 3341 3 +342 3343 3344 3345 3 346 3347 3348 For id: >NM_198399 1 QUERY: TTTTT appears 19 times position 182 908 920 1014 116 +9 1191 1192 1313 131 4 1315 1316 1317 1472 1671 1705 1706 1956 2389 2390 REVCOMP: AAAAA appears 20 times position 591 953 1051 1052 127 +5 2149 2364 2373 237 4 2398 2399 2400 2401 2402 2403 2404 2405 2406 2407 2408 For id: >NR_026816 1 QUERY: TTTTT appears 3 times position 371 372 384 REVCOMP: AAAAA appears 3 times position 593 594 595 For id: >NR_027917 1 QUERY: TTTTT appears 1 times position 339 For id: >NR_002777 3 QUERY: TTTTT appears 2 times position 921 922 REVCOMP: AAAAA appears 25 times position 286 577 975 976 977 9 +78 979 980 981 982 9 83 984 985 986 987 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 For id: >NR_033769 1 QUERY: TTTTT appears 7 times position 75 1176 1177 1229 1230 1321 +1322 REVCOMP: AAAAA appears 6 times position 629 1590 1591 1592 1598 1609 For id: >NM_016326 3 QUERY: TTTTT appears 1 times position 368 REVCOMP: AAAAA appears 5 times position 321 337 338 339 340 For id: >NM_181641 2 QUERY: TTTTT appears 4 times position 255 256 324 527 REVCOMP: AAAAA appears 5 times position 480 496 497 498 499 For id: >NM_001144931 1 For id: >NR_029429 1 For id: >NR_026551 1 QUERY: TTTTT appears 1 times position 415 For id: >NM_181640 2 QUERY: TTTTT appears 1 times position 464 REVCOMP: AAAAA appears 5 times position 417 433 434 435 436 For id: >NM_016951 3 QUERY: TTTTT appears 4 times position 255 256 324 623 REVCOMP: AAAAA appears 5 times position 576 592 593 594 595 For id: >NR_002773 1 QUERY: TTTTT appears 1 times position 1377 REVCOMP: AAAAA appears 2 times position 1929 1937 For id: >NR_037806 1 QUERY: TTTTT appears 6 times position 693 837 838 839 840 1114 REVCOMP: AAAAA appears 3 times position 1292 1293 1322

Replies are listed 'Best First'.
Re^2: homemade blat: need help with reverse string.
by nikkk (Novice) on Dec 16, 2015 at 14:58 UTC
    Hello, Thank you for your kind response! I worked on it and this program does everything i need
    use strict; my @bases; my @list; my @List; my %oligo = (); my %pos = (); my $base; my $list; my $f; my $nomefile; my $line; my $piece; my $query; my $len; my $sequenza; my $i; my $j; my $rev; my $revcomp; my $oligo; my $length = length($sequenza); my $piececomp; print("Insert query:\n"); $query = <STDIN>; chomp($query); $len = length($query); $query = uc $query; @bases = qw(A G C T); #bases @list = qw(A G C T); #seed for ($i = 0; $i < ($len - 1); $i++) { foreach $base (@bases) { foreach $list (@list) { $oligo{$base.$list} = 1; } @List = keys %oligo; } @list = keys %oligo; %oligo = (); #print "@list xxx \n"; } %oligo = (); @List = sort @List; $nomefile = $ARGV[0]; open $f, "< $nomefile" or die "cannot open $nomefile: $!"; $line = <$f>; chomp($line); if(substr($line,0,1) ne ">") { print STDERR ("Input file $nomefile not in FASTA format\n"); exit; } $sequenza = <$f>; while($line = <$f>) { chomp($line); if(substr($line, 0, 1) ne ">") { $sequenza = $sequenza.$line; } } $sequenza = uc $sequenza; for($i = 0; $i < length($sequenza) - ($len - 1); $i++) # nell'ultimo f +rame di lettura non deve contare la len intera se no va 'fuori sequen +za' e non trova l'oligo corrispondente { $piece = substr($sequenza, $i, $len); if(exists $oligo{$piece}) { $oligo{$piece} = $oligo{$piece} + 1; push(@{$pos{$piece}} , $i ); } else { $oligo{$piece} = 1; push(@{$pos{$piece}} , $i ); } } foreach $piece (sort keys %oligo) { print("$piece\t$oligo{$piece}\t@{$pos{$piece}}\n\n"); } $rev= reverse $query; $rev=~ tr/ACTG/TGAC/; $revcomp = $rev; for($j = $length; $j < $length - $len; $j--) { $revcomp = substr($sequenza, $j, $len); if(exists $oligo{$revcomp}) { push(@{$pos{$revcomp}} , ($length - $j) - ($len - 1)); } } if(exists $oligo{$query} && !exists $oligo{$revcomp}) { print("$query\t compare $oligo{$query} volte\t in posizione @{$pos +{$query}} solo sul filamento positivo\n"); } else { if(exists $oligo{$query} && exists $oligo{$revcomp}) { print("$query\t compare $oligo{$query} volte\t in posizione @{ +$pos{$query}} sul filamento positivo\n\ne compare $oligo{$revcomp} vo +lte\t in posizione @{$pos{$revcomp}}\t sul filamento negativo\n"); } else { if(!exists $oligo{$query} && exists $oligo{$revcomp}) { print("$query compare $oligo{$revcomp} volte\t in posizion +e @{$pos{$revcomp}}\t sul filamento negativo\n"); } else { print("$query non compare nella sequenza\n"); } } }
    It tells me if an oligo appears only in positive strand, in positive and negative and only in negative strand by searching in an hash table i created using the length of the query written from keyboard. I'm searching an oligo on the negative strand by searching its reverse complementary on the positive strand. The sequence i'm working on begins with GATCAC and end with ACGATG. The program works fine with these sequence, but when i search their reverse complementary, the position on the negative strand are wrong (if i search the reverse complementary of the first oligo in the positive strand, it should be in the last position of the negative strand, instead the program tells me that it is in the first one). This is the problem! However you've done a great work, and it should help me with my problem!

      Looking at just $length and $len, the code shown has the following content:

      ... my $len; my $sequenza; ... my $j; ... my $length = length($sequenza); ... print "Insert query:\n"; $query = <STDIN>; chomp $query; $len = length($query); ... for($j = $length; $j < $length - $len; $j--) { ... } ...

      As you can see, when $length is initialized, $sequenza is undefined, so $length still has a numerical value of zero (as I pointed out above).

      However, even if $length were to be initialized correctly before the for loop, the loop condition still looks wrong to me. $j is initially assigned the value in $length, and is then decremented on each iteration. The stopping condition is $j < $length - $len, and neither $length nor $len is changed in the loop body. So if the condition is initially true, i.e., if $j is initially less than some fixed value, decrementing $j will never make the condition false (well, not until an integer underflow occurs, anyway), which means you effectively have an infinite loop.

      Hope that helps,

      Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

        I noticed looking up at the UCSC browser's BLAT that i was mistaking the evaluation of the position on the single strand! Sure you help was well accepted, thank you very much!
Re^2: homemade blat: need help with reverse string.
by nikkk (Novice) on Dec 19, 2015 at 17:05 UTC
    Indeed you're code was very very interersting, but i needed to create that hash table containing the oligos! However i finally came up on how to calculate the right positions of oligos on the negative strand thanks to your program (and looking up on how they were evalueted on the UCSC browser BLAT). Thank you for your kind help!

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1150400]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others wandering the Monastery: (6)
As of 2024-04-23 07:09 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found