http://www.perlmonks.org?node_id=1149421

nikkk has asked for the wisdom of the Perl Monks concerning the following question:

i wrote a script that given a sequence extracts a 'n' dimension hashtable of oligos, and tells how many times and where a certain oligo is located. I want to insert from keyboard an oligo and extract the positions, but ican't write it down. Any suggestions? Thank you very much! That's my script, that when asked to insert query returns the name of the file i opened first....
use strict; my $f; my $nomefile; my $line; my %oligo = (); my %pos = (); my $piece; my $query; my $sequenza; my $i; $nomefile = $ARGV[0]; open $f, "< $nomefile" or die "cannot open $nomefile: $!"; $sequenza = <$f>; while($line = <$f>) { chomp($line); if(substr($line, 0, 1) ne ">") { $sequenza = $sequenza.$line; } } $sequenza = uc $sequenza; for($i = 0; $i < length($sequenza) - 4; $i++) { $piece = substr($sequenza, $i, 4); if(exists $oligo{$piece}) { $oligo{$piece} = $oligo{$piece} + 1; push(@{$pos{$piece}} , $i + 1); } else { $oligo{$piece} = 1; push(@{$pos{$piece}} , $i + 1); } } foreach $piece (sort keys %oligo) { print("$piece\t$oligo{$piece}\t@{$pos{$piece}}\n"); } print("Insert query:\n"); $query = <>; chomp($query); $query = uc $query; if(exists $oligo{$query}) { print("$query\t compare $oligo{$query} volte\t in posizione @{$pos{$q +uery}}\n"); } else { print("$query does not appear in the sequence"); }

Replies are listed 'Best First'.
Re: Question: homemade blat: insert query from keyboard to find an oligo in an hashtable
by Cristoforo (Curate) on Dec 05, 2015 at 02:37 UTC
    Hello nikkk. To answer your question, the problem is this line: $query = <>;. The diamond operator, <> opens a file supplied in the command line (contained in @ARGV, so $query now contains the first line from that file. There are 2 ways I can think of to avoid this.

    First, you can change that line to $query = <STDIN>; and the program will read what you type in instead of doing the magic of reading from the file contained in @ARGV.

    Second, you could empty the @ARGV array at the top of your program by saying $nomefile = shift @ARGV;. shift would remove the first (and in your case the only) filename from the command line and that would solve the problem when using the empty diamond <> operator later in the program.

    Even though the second method would solve your problem, it is probably still better to use <STDIN> when reading a response from the command line anyway. That way, you will avoid any problems in your program.

    I tried to understand what your program does and attempted to solve it using the Bio::SeqIO module which provides a good way to deal with fasta files. Documentation for BioPerl is here.

    My example program and sample fasta data file are below.

    I hope this helps answer your question. If you don't understand some of the code I wrote, ask and I'll try to explain. There are probably some new things there that you may not have seen yet.

      Thank you very much for your kind answer, it was very simple, i'm new in using Perl but i'm enjoying it very much!

      Can i ask a little enlightment please?

      It's possible to search for a query not the length used to create the index (for example a query >n or a multiple of n) without recreating the whole hashtable?

      And it seems that the position of the oligos found in the sequence are shifted by one, i know that we count from zero but i can't figure out how to change it...

      Thanks!
        It's possible to search for a query not the length used to create the index (for example a query >n or a multiple of n) without recreating the whole hashtable?
        This program does it.

        Update: Changed program so it finds matches in while loop - no need to store data. Process it as you read it.

        #!/usr/bin/perl use strict; use warnings; # 'shift' removes items from @ARGV my $nomefile = shift; open my $fh, '<', $nomefile or die "Unable to open $nomefile. $!"; print "Insert query:\n"; chomp(my $query = uc <STDIN>); my $sequenza; chomp(my $id = <$fh>); # reads first line ('should' be ID) print "\nResults for $nomefile\n\n"; while (my $line = <$fh>) { chomp $line; if (substr($line, 0, 1) eq '>') { find_matches($id, $sequenza, $query); $sequenza = ''; $id = $line; } else { $sequenza .= uc $line; } find_matches($id, $sequenza, $query) if eof; } close $fh or die "Unable to close $nomefile. $!"; sub find_matches { my ($id, $sequenza, $query) = @_; my @pos; while ($sequenza =~ /(?=$query)/g) { push @pos, $-[0] + 1; } if (@pos) { printf "%s compare %d volte in posizione @pos with ID %s\n", $query, scalar @pos, $id; } else { print "$query does not appear in the sequence with ID $id\n"; } }
        For 3 runs of the program (using the same data file I used in my first post).
        C:\Old_Data\perlp>perl t33.pl fasta.txt Insert query: ttag Results for fasta.txt TTAG compare 1 volte in posizione 108 with ID >chr1 TTAG does not appear in the sequence with ID >chrM C:\Old_Data\perlp>perl t33.pl fasta.txt Insert query: aaaaa Results for fasta.txt AAAAA compare 3 volte in posizione 59 130 131 with ID >chr1 AAAAA does not appear in the sequence with ID >chrM C:\Old_Data\perlp>perl t33.pl fasta.txt Insert query: tagcgat Results for fasta.txt TAGCGAT does not appear in the sequence with ID >chr1 TAGCGAT does not appear in the sequence with ID >chrM

         

        And it seems that the position of the oligos found in the sequence are shifted by one, i know that we count from zero but i can't figure out how to change it

        This code push @pos, $-[0] + 1; will give the position as though counting from 1 (instead of 0). It adds 1 to the offset of the beginning of the pattern match.

        The @- special variable can be found here Variables related to regular expressions. Scroll down to •@LAST_MATCH_START.

        It says

        $-[0] is the offset of the start of the last successful match. $-[n] is the offset of the start of the substring matched by n-th subpattern, or undef if the subpattern did not match.
Re: Question: homemade blat: insert query from keyboard to find an oligo in an hashtable
by Old_Gray_Bear (Bishop) on Dec 04, 2015 at 23:11 UTC
    For those of us who are not Biologists, you should define your terms. Specifically, what is an "oligo"? And what do you mean when you say "I can't write it down"?

    Providing an small sample of your input data, an example query string, and an example of what you expect to see in the output will be very helpful.

    ----
    I Go Back to Sleep, Now.

    OGB

      Thank you for the advices, for the next time i'll be more clear!