http://www.perlmonks.org?node_id=11113020

Sofie has asked for the wisdom of the Perl Monks concerning the following question:

Hi I am very new to perl and struggling with simple things... I am trying to check if an input DNA sequence only contains nucleotides. And if it doesn't I want to print out the position in the sequence where an invalid character was entered. This is as far as I have come:
#!/usr/bin/perl -w $DNA = <STDIN>; chomp ($DNA); @DNA = split ("", $DNA); $lengthseq = scalar @DNA; print "The length of the sequence is:\n", $lengthseq, "\n"; @nucleotideDNA = ""; #check if each element in array is nucleotide foreach $nucleotide (@DNA){ if ($nucleotide =~ /^[ATCG]+$/){ push @nucleotideDNA, $nucleotide; } else { push @nonvalid, $nucleotide; } }
But how can I print the position of the non valid character? Not sure if this makes any sense.. Thanks

Replies are listed 'Best First'.
Re: Find element in array
by tobyink (Canon) on Feb 16, 2020 at 12:46 UTC

    Instead of matching valid sequences, match invalid characters. Then use $-[0] to find the position of that match. (The @- array is documented in the "perlvar" manual page.)

    use strict; use warnings; while (my $sequence = <DATA>) { chomp $sequence; if ($sequence =~ /[^ATCG]/){ warn "Sequence '$sequence' has invalid character after " . $-[ +0]; } else { print "Valid sequence: '$sequence'\n"; } } __DATA__ TAAGAACAATAAGAACAA TAAGAACAATAAUAACAA TAAGAACAATAAGAACAA

    You don't need to split the sequence up into individual characters and process each one separately. That's slow.

      Addendum: $. aka $INPUT_LINE_NUMBER will also give the current line number when reading from a file handle.

      Update

      use strict; use warnings; while (my $sequence = <DATA>) { chomp $sequence; if ($sequence =~ /[^ATCG]/){ warn "Sequence '$sequence' in line $. has invalid character af +ter " . $-[0]; } else { print "Valid sequence: '$sequence'\n"; } } __DATA__ TAAGAACAATAAGAACAA TAAGAACAATAAUAACAA TAAGAACAATAAGAACAA

      Sequence 'TAAGAACAATAAUAACAA' in line 2 has invalid character after 12 at parse_dna.pl line 7, <DATA> line 2.
      

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      Wikisyntax for the Monastery

        Take a look at that warning message:

        Sequence 'TAAGAACAATAAUAACAA' in line 2 has invalid character after 12 + at parse_dna.pl line 7, <DATA> line 2. ^^^^^^ + ^^^^^^

        The warn function already includes $. in its output, unless your message ends in "\n".

        I still can't get this to work. I am obviously missing something. So what I want to do is: Use an input DNA sequence and put in array. For each element in array check if the element is a nucleotide. If not a nucleotide print "not a valid character at position xx". In the end, print out the number of characters that were valid and the number of characters that were not valid, calculate the percentage of non-valid characters. Does this make sense? Thanks
      Thanks for your quick reply. I still don't get it to work, sorry I am really a beginner at this. The ultimate goal is to print out all the positions in the original array @DNA that has an invalid character. And then to count the number of characters that are invalid. How is the if statement looking or the invalid characters? How can I use the $-[0] to find the position of the invalid characters? Thanks again
Re: Find element in array
by johngg (Canon) on Feb 16, 2020 at 15:28 UTC

    Does this do what you want? There is no need to split the sequence into an array as pos will allow you to find where in a string a match has been made. Note that [^ACGT] is a negative character class, i.e. match anything that isn't A, C, G or T. Using capturing parentheses, ( ... ), and matching globally, m{ ... }g or / ... /g will advance along the sequence looking for invalid letters.

    I am opening a file that is held inside the script just to keep things tidy on my system but the code will work fine with STDIN. The code.

    use 5.026; use warnings; open my $dnaFH, q{<}, \ <<__EOD__ or die $!; TAAGAACAATAAGAACAAGAACAATAA GAACAATAAGXAATAAGAAXXAACAAGAACAATAA ACAATAAAAGAACAATAAGAA __EOD__ while ( my $sequence = <$dnaFH> ) { chomp $sequence; my $length = length $sequence; say qq{Sequence: $sequence -- Length $length}; if ( $sequence =~ m{^[ACGT]+$} ) { say q{ Sequence is GOOD!}; } else { my @badPosns; push @badPosns, pos $sequence while $sequence =~ m{(?x) (?= ( [^ACGT] ) )}g; my $nBad = scalar @badPosns; my $perc = sprintf q{%.2f}, $nBad / $length * 100; say qq{ Sequence is BAD at @badPosns}; say qq{ $nBad bad positions, $perc\% of total}; } } close $dnaFH or die $!;

    The output.

    Sequence: TAAGAACAATAAGAACAAGAACAATAA -- Length 27 Sequence is GOOD! Sequence: GAACAATAAGXAATAAGAAXXAACAAGAACAATAA -- Length 35 Sequence is BAD at 10 19 20 3 bad positions, 8.57% of total Sequence: ACAATAAAAGAACAATAAGAA -- Length 21 Sequence is GOOD!

    I hope this is helpful. Please ask further if you need more help.

    Update: There was a mistake in the code, I should have used a look-ahead assertion as without that pos gives the position after the match, not that of the match itself. Added extended syntax ((?x)) to make the regex clearer. My bad :-(

    Update 2: I should also have corrected the output, now done.

    Cheers,

    JohnGG

Re: Find element in array
by GrandFather (Saint) on Feb 16, 2020 at 20:30 UTC

    You have already been given the pieces you need, but they don't fit your hand and you haven't shown us what you have tried when you say "I still don't get it to work". The code below is a slightly more fully worked example using suggestions you have already been given:

    use strict; use warnings; my $inputLines = <<LINES; TAAGAACAATAAGAACAA TAAGAACA.TAAGAACAA TAAG-ACA.TAAGAA_AA LINES open my $inFile, '<', \$inputLines or die "Can't open file: $!\n"; while (my $line = <$inFile>) { chomp $line; next if $line !~ /([^ACGT])/; print "In line $.: $line\n"; my $offset = 0; while ($line =~ /([^ACGT])/) { print "Found $1 at ", $-[1] + $offset, "\n"; substr $line, 0, $-[1] + 1, ''; $offset += $-[1] + 1; } }

    Prints:

    In line 2: TAAGAACA.TAAGAACAA Found . at 8 In line 3: TAAG-ACA.TAAGAA_AA Found - at 4 Found . at 8 Found _ at 15

    This is a "Simple Self Contained Example". You can run the code without needing anything else. You should first copy this code (cut and paste is highly recommended) and check that it works yourself. Then play with it until you have some understanding of how it works. Then adapt it to you own needs.

    There are some important things there. Note the use of strictures (use strict; use warnings;). Always use strictures in your code! The my $inputLines = <<LINES; and following lines create a variable initialised with multiple lines of text. That is used in open my $inFile, '<', \$inputLines or die "Can't open file: $!\n"; as a file. You can replace \$inputLines with a file name to open a file instead.

    In while (my $line = <$inFile>) { you could instead use <STDIN> to read lines from the command line.

    You are already using a regular expression so we assume you know something about those. If you don't, ask. The new bit is that $-[$n] gives the 0 based position (index) of the $n'th match.

    substr is used to trim the line to the point of the matched character ready to find the next bad character.

    Optimising for fewest key strokes only makes sense transmitting to Pluto or beyond
Re: Find element in array
by kcott (Archbishop) on Feb 17, 2020 at 07:04 UTC

    G'day Sofie,

    Welcome to the Monastery.

    "I am trying to check if an input DNA sequence only contains nucleotides."

    That's a good start: you've succinctly stated your main goal.

    "And if it doesn't I want to print out the position in the sequence where an invalid character was entered."

    Excellent: you a have a subtask; also succinctly stated.

    "From title: Find element in array"

    In my opinion, this is where you started to go wrong. You decided that you needed to split the entire sequence into individual characters and assign those to an array; then go back and iterate the entire array checking each individual character. DNA sequences can be exceptionally long — you may be well aware of this — and doing all this extra work is completely unnecesssary for your stated goals.

    Here's a script that does what you want. I've had to make some guesses about the output as you didn't specify that.

    #!/usr/bin/env perl use strict; use warnings; my $DNA = <STDIN>; chomp($DNA); my $lengthseq = length $DNA; print "The length of the sequence is: $lengthseq\n"; my (@nucleotideDNA, @nonvalid); for my $pos (0 .. $lengthseq - 1) { my $nucleotide = substr $DNA, $pos, 1; if ($nucleotide =~ /^[ACGT]$/) { push @nucleotideDNA, $pos+1 . ":\t$nucleotide"; } else { push @nonvalid, $pos+1 . ":\t$nucleotide"; } } print "*** nucleotideDNA ***\n"; print "$_\n" for @nucleotideDNA; print "*** nonvalid ***\n"; print "$_\n" for @nonvalid;

    Here's a sample run:

    $ ./pm_11113020_parse_dna.pl XACGTYTGCAZ The length of the sequence is: 11 *** nucleotideDNA *** 2: A 3: C 4: G 5: T 7: T 8: G 9: C 10: A *** nonvalid *** 1: X 6: Y 11: Z

    You may have noticed that I've structured my code in a similar way to yours. Let's look at the differences.

    • The shebang line, #!... on line one, can be written in various ways; you can read more about that in "perlrun". You'll note that I do not have the "-w" command switch at the end and I recommend that you don't use it either: see "perlrun: Command Switches: -w" for more about that.
    • Next you'll see I've used the "strict" and "warnings" pragmata. You should put those two lines at the top of all your code. See "perlintro: Safety net" for more about that.
    • The next couple of lines are almost identical except that I've used "my" to declare the $DNA variable. If you look down the code, you'll see I've declared all variables the same way. See "perlintro: Perl variable types" for more.
    • I've then skipped creation of the @DNA array, as already discussed; got the length using the "length" function; then printed the result. Note how I've interpolated $lengthseq into the print string.
    • Next, I've declared two array variables in one statement. There's no need to initialise an array; although, some people like to do that — if you do, use an empty list '()', not a zero-length string '""'.
    • Instead of looping through all of the elements of an array, I loop through a range of numbers using the range operator, '..'. See "perlop: Range Operators" for more on that.
    • I access each (potential) nucleotide using "substr".
    • My regex is almost identical to yours except I've omitted the '+': that indicates matching one or more characters and, in each iteration, there's only one character. Use of anchors, '^' and '$' in this case, is good; as a general rule, it will make your regexes more efficient.
    • I've pushed nucleotides onto arrays as you did. I also included the sequence position — note that's one more ($pos+1) than the string position ($pos). As already stated, I made some guesses here because you didn't say exactly what you wanted.
    • Lastly, a series a print statements just shows the results. You'll probably want something different here.
    "... I am very new to perl ..."

    That's fine, we all started knowing nothing about Perl. Note that Perl is the language and perl is the program.

    I recommend you read through "perlintro" and bookmark that page. There's no need to try and learn it all in one sitting; just get a general feel for what it has to offer. It is peppered with links to FAQs, tutorials and more detailed information. Refer back to it whenever the need arises.

    Finally, in case you had some genuine, but unstated, reason to use an array, you could have iterated it like this:

    for my $pos (0 .. $#DNA) { ... }

    Then accessed each element with $DNA[$pos] and reported the position with $pos+1 as I did.

    Using the range operator (..) is a standard way to do this: see "perlop: Range Operators" for details.

    I don't think that's what you wanted, or needed, here. You've at least learned how to do this in a more appropriate scenario at some other time.

    — Ken

Re: Find element in array
by LanX (Saint) on Feb 16, 2020 at 12:43 UTC
    Do you mean the current index of @DNA when looping?

    TIMTOWTDI!

    The traditional way with your code is to use an explicit counter:

    my $idx=-1; for my $nucleotide (@DNA){ $idx++; ... }

    But you can also apply while / each on arrays

    while ( my ($idx, $nucleotide) = each @DNA){ ... }

    (I tapped this message into a mobile, no guaranty for the code.)

    HTH! :)

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    Wikisyntax for the Monastery

      es the current index of the @DNA. I want to have a warning for each position of an element in the array that contains an invalid character. And then count the number of invalid characters.

Re: Find element in array
by BillKSmith (Monsignor) on Feb 17, 2020 at 04:14 UTC
    Unlike the "C" language, Perl strings are much different from arrays. In Perl, you usually have a choice of which to use. String solutions are usually easier and almost always execute faster. Here is a string-only solution to your problem.
    use strict; use warnings; my $DNA = "ATATCCCGATCAGG3TT!GCA\n"; chomp $DNA; print "The length of the sequence is:\n", length($DNA), "\n"; my $nucleotideDNA = $DNA; #my $count = $nucleotideDNA =~ tr/ATCG]//c; # Remove and count invali +ds my $count = $nucleotideDNA =~ tr/ATCG//cd; # Remove and count invalid +s my $locations; # Find location of invalids in original string $locations .= "$-[0], " while ( $DNA =~ /[^ATCG]/g ); print "There are $count non-valid nucleotides at locations:\n$location +s \n"; OUTPUT: The length of the sequence is: 21 There are 2 non-valid nucleotides at locations: 14, 17,

    UPDATE: Modified one line of code to correct errors identified by AnomalousMonk (below) Original remains as comment.

    Bill
      my $count = $nucleotideDNA =~ tr/ATCG]//c;  # Remove and count invalids

      Sofie:   Note that while this  tr/// (see Quote-Like Operators in perlop) expression counts the number of characters that are not ATCG, it does not remove anything; the string is not changed (update: nor is there any need for change):

      c:\@Work\Perl\monks>perl -wMstrict -le "my $DNA = 'ATATCCCGATCAGG3TT!GCA'; ;; my $nucleotideDNA = $DNA; my $count = $nucleotideDNA =~ tr/ATCG//c; ;; print $DNA; print $nucleotideDNA; ;; print 'sequences are equal' if $DNA eq $nucleotideDNA; " ATATCCCGATCAGG3TT!GCA ATATCCCGATCAGG3TT!GCA sequences are equal
      Also note that there is a  ] character in the set |  tr/// search set that should not be there.

      However, I agree with the main point that BillKSmith is making: string operations with regexes or with operators like substr and index will tend to be significantly faster (update: and to consume significantly less memory) than equivalent array operations.


      Give a man a fish:  <%-{-{-{-<

Re: Find element in array
by clueless newbie (Curate) on Feb 17, 2020 at 00:51 UTC
    #!/usr/bin/env perl # https://www.perlmonks.org/?node_id=11113020 use Data::Dumper; $Data::Dumper::Sortkeys=1; # make it easier +to find "things" in the "dumps" # Yes we'll take all the help we can get use strict; use warnings; # Read the data one line at a time while (<DATA>) { # Same as while +($_=<DATA>) { # And get rid of the $INPUT_LINE_SEPARATOR (\n); chomp; # Same as chomp( +$_); warn Data::Dumper->Dump([\$_],[qw(*_)]),' '; # Let's see what + we have ... there shouldn't be a trailing \n (my $bad=$_)=~ tr/[ATCG]/ /; warn Data::Dumper->Dump([\$_],[qw(*_)]),' '; # So all the val +id ones are gone if ($bad !~ m{^\s+$}) { # Are there any +bad ones? print "line: $.\n" # Yes - so print + the offending line number print "$_\n"; # The offending +line print "$bad\n"; # The offending +character(s) in the line $bad=~ s{\w}{print 1+pos($bad),','}eg; # Look for a "ch +aracter" if you find one print its "location' print "\n"; }; }; __END__ TAAGAACAATAAGAACAA TAAGAACAATAAUAACAA TAYGAACAkTAAGAACzz

    Yields

    perl 3020a.pl 2> nul TAAGAACAATAAUAACAA U 13, TAYGAACAkTAAGAACzz Y k zz 3,9,17,18,

      This reply would be better if it actually compiled. Line 19 (the first print in the if body) is missing a semi-colon - obviously not the code you ran to produce the given output!

      It would also be better to use a manifest variable instead of the default variable in the while statement so both the intent and scope are clearer. Bundling an assignment and operation on a variable into one line probably isn't best practice in an example for someone new to Perl ((my $bad=$_)=~ tr/[ATCG]/ /;).

      Do you really expect $bad=~ s{\w}{print 1+pos($bad),','}eg; to make sense to an entry level Perl user, even with the comment?

      Optimising for fewest key strokes only makes sense transmitting to Pluto or beyond

        mea culpa, mea culpa, mea maxima culpa!

Re: Find element in array
by Anonymous Monk on Feb 17, 2020 at 01:38 UTC
    The secret-sauce that you need to read more about are the /g and /c modifiers of regular expressions. If you are looking for characters in a string which are not valid, and can devise a regex which identifies them, then these modifiers will let you loop through the same string multiple times finding all the matches. Yes, it turns your original approach to the problem precisely upside-down. There are lots of good examples elsewhere on the Internet, including some particular to DNA strings.