Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

Defining substring matches

by drhicks (Novice)
on Sep 21, 2013 at 00:12 UTC ( [id://1055088]=perlquestion: print w/replies, xml ) Need Help??

drhicks has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I am writing a script to search for motifs in DNA sequence. A user of this program would input at the command line, their fasta file containing the DNA sequence string, and a particular motif to search for. I wish to use the substr() function to search for the particular motif. So far, this is easy enough to do when using A = A, T = T, C = C, and G = G. However, I would like to add the rest of the IUPAC code so that users may define motifs such as ATCNRC, where N = A, T, C, or G and R = A or G? The full IUPAC nucleic acid code is below.

A = '[A]'; C = '[C]'; G = '[G]'; T = '[T]'; R = '[AG]'; Y = '[CT]'; M = '[AC]'; K = '[GT]'; W = '[AT]'; S = '[GC]'; B = '[CGT]'; D = '[AGT]'; H = '[ACT]'; V = '[ACG]'; N = '[ACGT]';
Thanks for any help!

Replies are listed 'Best First'.
Re: Defining substring matches
by BrowserUk (Patriarch) on Sep 21, 2013 at 00:42 UTC
    I wish to use the substr() function to search for the particular motif.
    #

    substr is *not* designed for (nor capable of) searching for anything; so why are you specifying that particular function?

    You've defined your IUPAC codes in terms of regex character classes; so why are you eschewing the regex engine?

    Given your table, it is trivial to convert IUPAC codes into a regex and use the regex engine to search your fasta file:

    my %IUPAC = ( A => '[A]', C => '[C]', G => '[G]', T => '[T]', R => '[AG]', Y => '[CT]', M => '[AC]', K => '[GT]', W => '[AT]', S => '[GC]', B => '[CGT]', D => '[AGT]', H => '[ACT]', V => '[ACG]', N => '[ACGT]', ); my( $file, $motif ) = @ARGV; my $re = join '', map $IUPAC{ $_ }, split '', $motif; open FASTA, '<', $file or die $!; getc( FASTA ); ## discard first '>' until( eof( FASTA ) ) { chomp( my $id = <FASTA> ); ## read ident my $seq = do{ local $/ = '>'; <FASTA> }; $seq =~ tr[\n>][]d; while( $seq =~ m[($re)]g ) { printf "Found: '$1' at '$id':%d\n", $-[0]; } }

    NB: The above is untested code typed directly into my browser.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      until( eof( FASTA) {

      You have two left parentheses but only one right parenthesis.

      $seq =~ tr[\n][];

      You are replacing the newline character with a newline character?

        You are replacing the newline character with a newline character?

        Indeed, to count newline characters, but not save the count? Hmmm:)

        See above:
        NB: The above is untested code typed directly into my browser.

        Corrected.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
      while( $seq =~ m[($re)]g ) { printf "Found: '$1' at '$id':%d\n", $-[0]; }

      A question of idle (and perhaps rather trivial) curiosity: In the quoted code, you use  $-[0] "offset of the start of the last successful match" (see  @- in perlvar). My reflexive choice would have been  $-[1] since capture group 1 is being matched. There's no difference in the behavior of the code since capture group 1 is all that's matched, but was there a particular reason you chose as you did?

        Um. I cannot remember ever using indexes above 0 with @- or @+. I'm sure I probably have at some point, but I don't remember doing so.

        When I first typed it, I didn't use capture brackets and only printed the id and position. I added the capture brackets and '$1' as an afterthought.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

      This code works perfect for the job. It is so much simpler and does more than the script I had written using substr(). Thanks Derrick

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1055088]
Approved by Athanasius
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others romping around the Monastery: (3)
As of 2024-04-26 01:21 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found