Your skill will accomplishwhat the force of many cannot PerlMonks

How can I get correct result in counting 3-letter words?

by supriyoch_2008 (Monk)
 on Apr 22, 2012 at 19:31 UTC Need Help??
supriyoch_2008 has asked for the wisdom of the Perl Monks concerning the following question:

Hi Perl Monks,

I am a beginner in perl programming. I prepared a perl program to count the number of single letters and 3-letter words from a scalar variable \$a. I got correct results in counting single letters but wrong results for 3-letter words. Each 3-letter word occurs only once in the string with a few spaces and 2 Ns but my program is showing inflated values for 3-letter words due to overlapping effect of letters. I have given the wrong results that I got from the perl program and also the correct results that I expect.

I look forward to your suggestions to correct the program.

```#!/usr/bin/perl -w
# Counting 3-letter words in a string
\$a="NTCA TCC TCGTCTTTCTTTTTATTGTACTATTAATAGTGCTGTTGATGGCTACTCCTGCTTCCA
+CAT
CAACAGCGACGCCGGCGTATAATCATTATGACAACCACGACTAACAATAAAAAGAGCAGTAGAAGGCCCC
+CGC
CTCACGTAGTCGTGGTTGCAGCCGCGGCTGACGATGAAGAGGGAGGCGGGGGT N";
# Remove whitespace
\$a=~ s/\s//g;# Line
# Count number of 3-letter word
\$b=length(\$a);
# Initialization
\$A=0;\$T=0;\$G=0;\$C=0;\$e=0;
\$TCA=0;\$TCC=0;\$TCG=0;\$TCT=0;\$TTC=0;\$TTT=0;\$TTA=0;\$TTG=0;\$TAC=0;\$TAT=0;
\$TAA=0;\$TAG=0;\$TGC=0;\$TGT=0;\$TGA=0;\$TGG=0;\$CTA=0;\$CTC=0;\$CTG=0;\$CTT=0;
\$CCA=0;\$CAT=0;\$CAA=0;\$CAG=0;\$CGA=0;\$CGC=0;\$CGG=0;\$CGT=0;\$ATA=0;\$ATC=0;
\$ATT=0;\$ATG=0;\$ACA=0;\$ACC=0;\$ACG=0;\$ACT=0;\$AAC=0;\$AAT=0;\$AAA=0;\$AAG=0;
\$AGC=0;\$AGT=0;\$AGA=0;\$AGG=0;\$CCC=0;\$CCG=0;\$CCT=0;\$CAC=0;\$GTA=0;\$GTC=0;
\$GTG=0;\$GTT=0;\$GCA=0;\$GCC=0;\$GCG=0;\$GCT=0;\$GAC=0;\$GAT=0;\$GAA=0;\$GAG=0;
\$GGA=0;\$GGC=0;\$GGG=0;\$GGT=0;
while(\$a=~ /A/ig){\$A++}
while(\$a=~ /T/ig){\$T++}
while(\$a=~ /G/ig){\$G++}
while(\$a=~ /C/ig){\$C++}
while(\$a=~ /[^ATGC]/ig){\$e++}
while(\$a=~ /(TCA)/ig){\$TCA++}
while(\$a=~ /(TCC)/ig){\$TCC++}
while(\$a=~ /(TCG)/ig){\$TCG++}
while(\$a=~ /(TCT)/ig){\$TCT++}
while(\$a=~ /(TTC)/ig){\$TTC++}
while(\$a=~ /(TTT)/ig){\$TTT++}
while(\$a=~ /(TTA)/ig){\$TTA++}
while(\$a=~ /(TTG)/ig){\$TTG++}
while(\$a=~ /(TAC)/ig){\$TAC++}
while(\$a=~ /(TAT)/ig){\$TAT++}
while(\$a=~ /(TAA)/ig){\$TAA++}
while(\$a=~ /(TAG)/ig){\$TAG++}
while(\$a=~ /(TGC)/ig){\$TGC++}
while(\$a=~ /(TGT)/ig){\$TGT++}
while(\$a=~ /(TGA)/ig){\$TGA++}
while(\$a=~ /(TGG)/ig){\$TGG++}
while(\$a=~ /(CTA)/ig){\$CTA++}
while(\$a=~ /(CTC)/ig){\$CTC++}
while(\$a=~ /(CTG)/ig){\$CTG++}
while(\$a=~ /(CTT)/ig){\$CTT++}
while(\$a=~ /(CCA)/ig){\$CCA++}
while(\$a=~ /(CAT)/ig){\$CAT++}
while(\$a=~ /(CAA)/ig){\$CAA++}
while(\$a=~ /(CAG)/ig){\$CAG++}
while(\$a=~ /(CGA)/ig){\$CGA++}
while(\$a=~ /(CGC)/ig){\$CGC++}
while(\$a=~ /(CGG)/ig){\$CGG++}
while(\$a=~ /(CGT)/ig){\$CGT++}
while(\$a=~ /(ATA)/ig){\$ATA++}
while(\$a=~ /(ATC)/ig){\$ATC++}
while(\$a=~ /(ATT)/ig){\$ATT++}
while(\$a=~ /(ATG)/ig){\$ATG++}
while(\$a=~ /(ACA)/ig){\$ACA++}
while(\$a=~ /(ACC)/ig){\$ACC++}
while(\$a=~ /(ACG)/ig){\$ACG++}
while(\$a=~ /(ACT)/ig){\$ACT++}
while(\$a=~ /(AAC)/ig){\$AAC++}
while(\$a=~ /(AAT)/ig){\$AAT++}
while(\$a=~ /(AAA)/ig){\$AAA++}
while(\$a=~ /(AAG)/ig){\$AAG++}
while(\$a=~ /(AGC)/ig){\$AGC++}
while(\$a=~ /(AGT)/ig){\$AGT++}
while(\$a=~ /(AGA)/ig){\$AGA++}
while(\$a=~ /(AGG)/ig){\$AGG++}
while(\$a=~ /(CCC)/ig){\$CCC++}
while(\$a=~ /(CCG)/ig){\$CCG++}
while(\$a=~ /(CCT)/ig){\$CCT++}
while(\$a=~ /(CAC)/ig){\$CAC++}
while(\$a=~ /(GTA)/ig){\$GTA++}
while(\$a=~ /(GTC)/ig){\$GTC++}
while(\$a=~ /(GTG)/ig){\$GTG++}
while(\$a=~ /(GTT)/ig){\$GTT++}
while(\$a=~ /(GCA)/ig){\$GCA++}
while(\$a=~ /(GCC)/ig){\$GCC++}
while(\$a=~ /(GCG)/ig){\$GCG++}
while(\$a=~ /(GCT)/ig){\$GCT++}
while(\$a=~ /(GAC)/ig){\$GAC++}
while(\$a=~ /(GAT)/ig){\$GAT++}
while(\$a=~ /(GAA)/ig){\$GAA++}
while(\$a=~ /(GAG)/ig){\$GAG++}
while(\$a=~ /(GGA)/ig){\$GGA++}
while(\$a=~ /(GGC)/ig){\$GGC++}
while(\$a=~ /(GGG)/ig){\$GGG++}
while(\$a=~ /(GGT)/ig){\$GGT++}
print"\n No. of single letters:\n
A=\$A; T=\$T; G=\$G; C=\$C; Error=\$e.\n
Length: \$b.\n
No. of 3-letter words:\n
TCA=\$TCA; TCC=\$TCC; TCG=\$TCG; TCT=\$TCT; TTC=\$TTC;\n
TTT=\$TTT; TTA=\$TTA; TTG=\$TTG; TAC=\$TAC; TAT=\$TAT;\n
TAA=\$TAA; TAG=\$TAG; TGC=\$TGC; TGT=\$TGT; TGA=\$TGA;\n
TGG=\$TGG; CTA=\$CTA; CTC=\$CTC; CTG=\$CTG; CTT=\$CTT;\n
CCA=\$CCA; CAT=\$CAT; CAA=\$CAA; CAG=\$CAG; CGA=\$CGA;\n
CGC=\$CGC; CGG=\$CGG; CGT=\$CGT; ATA=\$ATA; ATC=\$ATC;\n
ATT=\$ATT; ATG=\$ATG; ACA=\$ACA; ACC=\$ACC; ACG=\$ACG;\n
ACT=\$ACT; AAC=\$AAC; AAT=\$AAT; AAA=\$AAA; AAG=\$AAG;\n
AGC=\$AGC; AGT=\$AGT; AGA=\$AGA; AGG=\$AGG; CCC=\$CCC;\n
CCG=\$CCG; CCT=\$CCT; CAC=\$CAC; GTA=\$GTA; GTC=\$GTC;\n
GTG=\$GTG; GTT=\$GTT; GCA=\$GCA; GCC=\$GCC; GCG=\$GCG;\n
GCT=\$GCT; GAC=\$GAC; GAT=\$GAT; GAA=\$GAA; GAG=\$GAG;\n
GGA=\$GGA; GGC=\$GGC; GGG=\$GGG; GGT=\$GGT.\n";
exit;
```Results obtained (wrong):

Microsoft Windows [Version 6.1.7600]

C:\Users\ >cd desktop

C:\Users\ \Desktop>wolf.pl

No. of single letters:

A=48; T=48; G=48; C=48; Error=2.

Length: 194.

No. of 3-letter words:

TCA=4; TCC=3; TCG=2; TCT=2; TTC=2;

TTT=2; TTA=3; TTG=3; TAC=2; TAT=4;

TAA=4; TAG=3; TGC=3; TGT=2; TGA=4;

TGG=2; CTA=3; CTC=3; CTG=3; CTT=3;

CCA=2; CAT=3; CAA=3; CAG=3; CGA=3;

CGC=3; CGG=3; CGT=4; ATA=3; ATC=3;

ATT=3; ATG=3; ACA=4; ACC=1; ACG=4;

ACT=3; AAC=3; AAT=3; AAA=1; AAG=3;

AGC=3; AGT=3; AGA=3; AGG=3; CCC=1;

CCG=3; CCT=3; CAC=3; GTA=4; GTC=2;

GTG=2; GTT=2; GCA=2; GCC=4; GCG=4;

GCT=4; GAC=4; GAT=2; GAA=2; GAG=3;

GGA=1; GGC=5; GGG=2; GGT=2.
```Correct Results should be like this:

No. of single letters:
A=48; T=48; G=48; C=48; Error=2.
Length: 194.

No. of 3-letter words:

TCA=1; TCC=1; TCG=1; TCT=1; TTC=1;

TTT=1; TTA=1; TTG=1; TAC=1; TAT=1;

TAA=1; TAG=1; TGC=1; TGT=1; TGA=1;

TGG=1; CTA=1; CTC=1; CTG=1; CTT=1;

CCA=1; CAT=1; CAA=1; CAG=1; CGA=1;

CGC=1; CGG=1; CGT=1; ATA=1; ATC=1;

ATT=1; ATG=1; ACA=1; ACC=1; ACG=1;

ACT=1; AAC=1; AAT=1; AAA=1; AAG=1;

AGC=1; AGT=1; AGA=1; AGG=1; CCC=1;

CCG=1; CCT=1; CAC=1; GTA=1; GTC=1;

GTG=1; GTT=1; GCA=1; GCC=1; GCG=1;

GCT=1; GAC=1; GAT=1; GAA=1; GAG=1;

GGA=1; GGC=1; GGG=1; GGT=1.

Replies are listed 'Best First'.
Re: How can I get correct result in counting 3-letter words?
by ikegami (Pope) on Apr 22, 2012 at 20:02 UTC

You did not constrain where any of the triplets could match. You could replace

```while (\$seq =~ /GCT/ig) { \$GCT++; }

with

```while (\$seq =~ /\G(?:...)*?GCT/sig) { \$GCT++; }

or with

```while (\$seq =~ /\G(...)/sg) { \$GCT++ if uc(\$1) eq 'GCT'; }

but lets take one further and use

```while (\$seq =~ /\G(...)/sg) { \$counts{uc(\$1)}++; }

```my %counts;
++counts{uc(\$_)} for \$seq =~ /.../sg;

for       my \$l1 (qw( T C A G )) {
for    my \$l2 (qw( T C A G )) {
for my \$l3 (qw( T C A G )) {
my \$k = "\$l1\$l2\$l3";
my \$v = \$counts{\$k} || 0;
print("\$k=\$v;");
}
print("\n");
}
}
That does not complile (missing \$ sigil).

did you mean the second line to read

```++\$counts{uc(\$_)} for \$a =~ /[TCAG]{3}/sg;
With that, I get the counts the OP expects.

All great truths begin as blasphemies.
― George Bernard Shaw, writer, Nobel laureate (1856-1950)

I prefer the function "variations_with_repetition" in the module Algorithm::Combinatorics rather than the do-it-yourself approach of generating the \$k's.

In my benchmarks, substr is about twice as fast as a /..../g regex for getting the next X characters:

```cail:~/work/perl/monks\$ cat 966488.pl
#!/usr/bin/env perl
use Modern::Perl;
use Benchmark qw(:all);

my \$string = '';
for (1..1_000_000){  # make a million-char string
\$string .= qw(A C G T)[rand(4)];
}
cmpthese( 100, {
'regex'     => \&regex,
'substring' => \&substring,
});

sub substring {
my \$str = \$string;
my %h;
while(length(\$str) % 3){    # snip to 3-letter boundary
substr(\$str,-1, 1, '');
}
while(\$_ = substr(\$str,0,3,'')){
\$h{\$_}++;
}
}

sub regex {
my \$str = \$string;
my %h;
for (\$str =~ /.../g){
\$h{\$_}++;
}
}
cail:~/work/perl/monks\$ perl 966488.pl
Rate     regex substring
regex     5.78/s        --      -49%
substring 11.4/s       97%        --

Of course, if you only want to match certain letters, then you're back to a regex. But in that case, I might still try stripping out all the stuff I don't want with tr//, followed by substr to break it into pieces.

Aaron B.
My Woefully Neglected Blog, where I occasionally mention Perl.

Humm, you didn't actually use my code. (Not that the results would be visibly different.)

Your suggestion to remove the offending letters is broken if said letters can appear anywhere but the beginning and end of the string. "AAAGNTTT" should give "AAA", "TTT", but you're algorithm would give "AAA" and "GTT".

Re: How can I get correct result in counting 3-letter words?
by 2teez (Vicar) on Apr 23, 2012 at 01:45 UTC

Hash to the rescue.

However, some 3 letter words occur more than onces, it shows correctly!
```#!/usr/bin/perl
use warnings;
use strict;

my \$a="NTCA TCC TCGTCTTTCTTTTTATTGTACTATTAATAGTGCTGTTGATGGCTACTCCTGCTT
+CCACAT
CAACAGCGACGCCGGCGTATAATCATTATGACAACCACGACTAACAATAAAAAGAGCAGTAGAAGGCCCC
+CGC
CTCACGTAGTCGTGGTTGCAGCCGCGGCTGACGATGAAGAGGGAGGCGGGGGT N";

\$a=~s/\s+?//g;
my %has_dat;
foreach my \$alpha(split//,\$a){
\$has_dat{\$alpha}++ if \$alpha eq  ('A'|'T'|'G'|'C') or \$alpha ne
+ ('A'|'T'|'G'|'C');
}
print \$_,'=',\$has_dat{\$_},' ' for keys %has_dat;

print "\nLength: ",length(\$a),\$/;

%has_dat=();
while(\$a=~m/\G(\w{3})/gc){
\$has_dat{\$1}++;
}
my \$counter=0;
foreach (sort keys %has_dat){
if(\$counter==5){print "\n";\$counter=0;}
print \$_,'=',\$has_dat{\$_},' ';++\$counter;
}
Re: How can I get correct result in counting 3-letter words?
by johngg (Abbot) on Apr 23, 2012 at 10:28 UTC

As 2teez points out, some three-letter words do occur more than once and a hash for counts rather than individual scalars is the way to go. You could use the glob function for initialization, saving a lot of fiddly, error-prone typing.

```my %threeLtrCts =
map { \$_ => 0 }
glob q{{A,T,G,C}{A,T,G,C}{A,T,G,C}};

Cheers,

JohnGG

Create A New User
Node Status?
node history
Node Type: perlquestion [id://966488]
Approved by Happy-the-monk
help
Chatterbox?
and cookies bake in the oven...

How do I use this? | Other CB clients
Other Users?
Others chilling in the Monastery: (3)
As of 2018-05-26 04:36 GMT
Sections?
Information?
Find Nodes?
Leftovers?
Voting Booth?
World peace can best be achieved by:

Results (192 votes). Check out past polls.

Notices?