Sofie has asked for the wisdom of the Perl Monks concerning the following question:
But how can I print the position of the non valid character? Not sure if this makes any sense.. Thanks#!/usr/bin/perl -w $DNA = <STDIN>; chomp ($DNA); @DNA = split ("", $DNA); $lengthseq = scalar @DNA; print "The length of the sequence is:\n", $lengthseq, "\n"; @nucleotideDNA = ""; #check if each element in array is nucleotide foreach $nucleotide (@DNA){ if ($nucleotide =~ /^[ATCG]+$/){ push @nucleotideDNA, $nucleotide; } else { push @nonvalid, $nucleotide; } }
|
---|
Replies are listed 'Best First'. | |
---|---|
Re: Find element in array
by tobyink (Canon) on Feb 16, 2020 at 12:46 UTC | |
Instead of matching valid sequences, match invalid characters. Then use $-[0] to find the position of that match. (The @- array is documented in the "perlvar" manual page.)
You don't need to split the sequence up into individual characters and process each one separately. That's slow. | [reply] [d/l] [select] |
by LanX (Saint) on Feb 16, 2020 at 13:29 UTC | |
Update
Sequence 'TAAGAACAATAAUAACAA' in line 2 has invalid character after 12 at parse_dna.pl line 7, <DATA> line 2.
Cheers Rolf | [reply] [d/l] |
by tobyink (Canon) on Feb 17, 2020 at 13:58 UTC | |
Take a look at that warning message:
The warn function already includes $. in its output, unless your message ends in "\n". | [reply] [d/l] [select] |
by LanX (Saint) on Feb 17, 2020 at 14:42 UTC | |
by Sofie (Acolyte) on Feb 16, 2020 at 14:56 UTC | |
| [reply] |
by LanX (Saint) on Feb 16, 2020 at 18:01 UTC | |
by Sofie (Acolyte) on Feb 16, 2020 at 13:22 UTC | |
| [reply] |
Re: Find element in array
by johngg (Canon) on Feb 16, 2020 at 15:28 UTC | |
Does this do what you want? There is no need to split the sequence into an array as pos will allow you to find where in a string a match has been made. Note that [^ACGT] is a negative character class, i.e. match anything that isn't A, C, G or T. Using capturing parentheses, ( ... ), and matching globally, m{ ... }g or / ... /g will advance along the sequence looking for invalid letters. I am opening a file that is held inside the script just to keep things tidy on my system but the code will work fine with STDIN. The code.
The output.
I hope this is helpful. Please ask further if you need more help. Update: There was a mistake in the code, I should have used a look-ahead assertion as without that pos gives the position after the match, not that of the match itself. Added extended syntax ((?x)) to make the regex clearer. My bad :-( Update 2: I should also have corrected the output, now done. Cheers, JohnGG | [reply] [d/l] [select] |
Re: Find element in array
by GrandFather (Saint) on Feb 16, 2020 at 20:30 UTC | |
You have already been given the pieces you need, but they don't fit your hand and you haven't shown us what you have tried when you say "I still don't get it to work". The code below is a slightly more fully worked example using suggestions you have already been given:
Prints:
This is a "Simple Self Contained Example". You can run the code without needing anything else. You should first copy this code (cut and paste is highly recommended) and check that it works yourself. Then play with it until you have some understanding of how it works. Then adapt it to you own needs. There are some important things there. Note the use of strictures (use strict; use warnings;). Always use strictures in your code! The my $inputLines = <<LINES; and following lines create a variable initialised with multiple lines of text. That is used in open my $inFile, '<', \$inputLines or die "Can't open file: $!\n"; as a file. You can replace \$inputLines with a file name to open a file instead. In while (my $line = <$inFile>) { you could instead use <STDIN> to read lines from the command line. You are already using a regular expression so we assume you know something about those. If you don't, ask. The new bit is that $-[$n] gives the 0 based position (index) of the $n'th match. substr is used to trim the line to the point of the matched character ready to find the next bad character.
Optimising for fewest key strokes only makes sense transmitting to Pluto or beyond
| [reply] [d/l] [select] |
Re: Find element in array
by kcott (Archbishop) on Feb 17, 2020 at 07:04 UTC | |
G'day Sofie, Welcome to the Monastery. "I am trying to check if an input DNA sequence only contains nucleotides." That's a good start: you've succinctly stated your main goal. "And if it doesn't I want to print out the position in the sequence where an invalid character was entered." Excellent: you a have a subtask; also succinctly stated. "From title: Find element in array" In my opinion, this is where you started to go wrong. You decided that you needed to split the entire sequence into individual characters and assign those to an array; then go back and iterate the entire array checking each individual character. DNA sequences can be exceptionally long — you may be well aware of this — and doing all this extra work is completely unnecesssary for your stated goals. Here's a script that does what you want. I've had to make some guesses about the output as you didn't specify that.
Here's a sample run:
You may have noticed that I've structured my code in a similar way to yours. Let's look at the differences.
"... I am very new to perl ..." That's fine, we all started knowing nothing about Perl. Note that Perl is the language and perl is the program. I recommend you read through "perlintro" and bookmark that page. There's no need to try and learn it all in one sitting; just get a general feel for what it has to offer. It is peppered with links to FAQs, tutorials and more detailed information. Refer back to it whenever the need arises. Finally, in case you had some genuine, but unstated, reason to use an array, you could have iterated it like this:
Then accessed each element with $DNA[$pos] and reported the position with $pos+1 as I did. Using the range operator (..) is a standard way to do this: see "perlop: Range Operators" for details. I don't think that's what you wanted, or needed, here. You've at least learned how to do this in a more appropriate scenario at some other time. — Ken | [reply] [d/l] [select] |
Re: Find element in array
by LanX (Saint) on Feb 16, 2020 at 12:43 UTC | |
TIMTOWTDI! The traditional way with your code is to use an explicit counter:
But you can also apply while / each on arrays
(I tapped this message into a mobile, no guaranty for the code.) HTH! :)
Cheers Rolf | [reply] [d/l] [select] |
by Sofie (Acolyte) on Feb 16, 2020 at 13:25 UTC | |
es the current index of the @DNA. I want to have a warning for each position of an element in the array that contains an invalid character. And then count the number of invalid characters. | [reply] |
Re: Find element in array
by BillKSmith (Monsignor) on Feb 17, 2020 at 04:14 UTC | |
UPDATE: Modified one line of code to correct errors identified by AnomalousMonk (below) Original remains as comment.
Bill
| [reply] [d/l] |
by AnomalousMonk (Archbishop) on Feb 17, 2020 at 06:23 UTC | |
my $count = $nucleotideDNA =~ tr/ATCG]//c; # Remove and count invalids Sofie: Note that while this tr/// (see Quote-Like Operators in perlop) expression counts the number of characters that are not ATCG, it does not remove anything; the string is not changed (update: nor is there any need for change): Also note that there is a ] character in the However, I agree with the main point that BillKSmith is making: string operations with regexes or with operators like substr and index will tend to be significantly faster (update: and to consume significantly less memory) than equivalent array operations. Give a man a fish: <%-{-{-{-< | [reply] [d/l] [select] |
Re: Find element in array
by clueless newbie (Curate) on Feb 17, 2020 at 00:51 UTC | |
Yields
| [reply] [d/l] [select] |
by GrandFather (Saint) on Feb 17, 2020 at 03:35 UTC | |
This reply would be better if it actually compiled. Line 19 (the first print in the if body) is missing a semi-colon - obviously not the code you ran to produce the given output! It would also be better to use a manifest variable instead of the default variable in the while statement so both the intent and scope are clearer. Bundling an assignment and operation on a variable into one line probably isn't best practice in an example for someone new to Perl ((my $bad=$_)=~ tr/[ATCG]/ /;). Do you really expect $bad=~ s{\w}{print 1+pos($bad),','}eg; to make sense to an entry level Perl user, even with the comment?
Optimising for fewest key strokes only makes sense transmitting to Pluto or beyond
| [reply] [d/l] [select] |
by clueless newbie (Curate) on Feb 17, 2020 at 15:12 UTC | |
mea culpa, mea culpa, mea maxima culpa! | [reply] |
Re: Find element in array
by Anonymous Monk on Feb 17, 2020 at 01:38 UTC | |
| [reply] |