Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW

Using grep in a scalar context

by newbie1991 (Acolyte)
on Feb 06, 2013 at 12:15 UTC ( #1017401=perlquestion: print w/replies, xml ) Need Help??
newbie1991 has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks! I'm trying to evaluate if a current array matches a preset pair and putting the number of occurrences of the pair in one element of the matrix. I've tried the following grep statement and it's not working. I think I know why (evaluating to true and false will not increment the counter each time, just replace it), but can someone help me out with what to do instead?

$FMat[0][0]=grep(/AA/, @dipept);

Essentially I want the [0][0] element to count how many times AA appears in my input sequence. I have already segmented it up into pairs (hence the @dipept, it only has 2 elements). As always, I'm just starting out and all help is appreciated. Thankyouu. :) PS : I'm a little shaky with hashes so if you suggest hashes please do expand on it a little.

Replies are listed 'Best First'.
Re: Using grep in a scalar context
by choroba (Chancellor) on Feb 06, 2013 at 12:52 UTC
    Works for me:
    perl -E '@dipept = qw/XX AA BB CC DDAAD AAAAAAA/; $FMat[0][0] = grep(/ +AA/, @dipept); say $FMat[0][0]' 3
    The documentation of grep clearly states:
    In scalar context, returns the number of times the expression was true.
    What input do you have? What output do you expect?
    لսႽ ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ
      My input is a string of amino acid bases, essentially it looks like MHDLKNDHASDRWT and I am counting pair occurrences throughout. MH HD DL LK KN ND are all accepted and should be counted.
Re: Using grep in a scalar context
by AnomalousMonk (Chancellor) on Feb 06, 2013 at 13:16 UTC

    You say the approach you are using is "not working", leaving us to guess what data you are using and what result you expect.

    My guesses are that either you want to count the number of strings in a dataset (i.e., an array) in which a pattern occurs at least once, or you want to count the total occurrences of a pattern in all strings in a dataset.

    The code you posted seems to serve for the first purpose. A variation using map seems to take care of the other. (Note that the pattern occurs twice in 'zAAzAAz'.) In neither case is the original dataset changed.

    >perl -wMstrict -le "my @FMat; my @dipept = qw(xAAx yAAy xyzzy zAAzAAz zzzz); ;; $FMat[0][0] = grep(/AA/, @dipept); print qq{grepped: $FMat[0][0]}; ;; $FMat[0][0] = map /AA/g, @dipept; print qq{mapped: $FMat[0][0]}; ;; print qq{@dipept}; " grepped: 3 mapped: 4 xAAx yAAy xyzzy zAAzAAz zzzz

    Update: Do we need to consider the question of overlapping versus non-overlapping pattern matches? I.e., How many matches are there in 'xAAAx'?

      I'm counting total occurrences of a pattern in the dataset. Input in array format is M H D L N with each element being one letter. Output should count how many times MH, HD, DL, etc. appear. The input is MUCH longer (it's a amino acid sequence). And yes, overlaps are considered. AAA has 2 matches.
        I am still not sure what your input is, but I hope you might find one of the following two solutions helpful:
        use Data::Dumper; my $string = 'MHDLKNDHASDRWT'; my %count_string; $count_string{$_}++ for $string =~ /(?=(..))/g; # Uses look-ahead to o +nly progress by one character. my @array = split //, $string; my %count_array; $count_array{ join q(), @array[$_, $_ + 1] }++ for 0 .. $#array - 1; print Dumper \%count_string, \%count_array;
        Note: /AA/ matches the capital letter A followed by the capital letter A. It does not stand for "anything" in regular expressions.

        Updated: Added the hashes.

        لսႽ ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ
Re: Using grep in a scalar context
by tobyink (Abbot) on Feb 06, 2013 at 13:49 UTC

    Are you maybe looking for something like this?

    use v5.12; use List::Util qw(sum); my @dipept = qw( AABBCC AABBAA ); my $count = sum map { scalar(my @r = /AA/g) } @dipept; say $count;

    (Outputs 3, because "AA" occurs three times in @dipept; once in the first string, twice in the second.)

    package Cow { use Moo; has name => (is => 'lazy', default => sub { 'Mooington' }) } say Cow->new->name

      That expansive map statement is rather extravagant, given ...

      my $count = scalar map { /AA/g } @dipept;

      ... would produce the same result. Am I missing something?

        is rather extravagant ... would produce the same result. Am I missing something?

        Not really, although, a clarity argument could be made

        map is mostly used for generating lists, so sometimes it feels like the wrong tool for counting ( is the use of map in a void context deprecated ? , What's wrong with using grep or map in a void context? )

        tobyink improves clarity by generating a list of counts and adding/summing them (makes an array of matches, scalar array is count), map made a list, feels good :)

        you rely on generating a list of matches ( m//g) , and that map in scalar context returns a count

        Is there a performance advantage? Penalty? Was map-in-scalar as expensive as map-in-void (before perlv5.8.1)?
        It doesn't really matter as the reason for using map over foreach is clarity/brevity/tradition. The basic intent echoed in all the manuals and books, map for transforming lists, grep for filtering lists, foreach(for) for counting (iterating).

        Compare the "three"

        my @dipept = qw( AABBCC AABBAA ); use v5.12; use List::Util qw(sum); my $count1 = sum map { scalar(my @r = /AA/g) } @dipept; my $count2 = scalar map { /AA/g } @dipept; my $count3 = 0; $count3 += /AA/g for @dipept; my $count5 = 0; map { $count5 += /AA/g } @dipept; my $count6 = 0; grep { $count6 += /AA/g } @dipept;

        If you're thinking in terms of map and grep, thinking perlishly, then you write it like it makes the most sense to you, feels most natural, reads instantly and effortlessly (reads like breathing), needs no thought ...

        so what you're missing, is tobyink's brains

        I simply didn't think about doing it that way. I could probably count the number of times I've used map in a scalar context on my fingers.

        That said, given that the original question was about gene sequences, which can be pretty long strings, it seems preferable to avoid the creation of a temporary array containing all matches. Whether that's an important concern or not depends on many factors (typical number of matches expected; typical length of matched strings; etc).

        package Cow { use Moo; has name => (is => 'lazy', default => sub { 'Mooington' }) } say Cow->new->name
Re: Using grep in a scalar context
by perlhappy (Novice) on Feb 06, 2013 at 16:12 UTC

    Hi newbie1991,

    I have a good idea of what you want to do and the data your dealing with (I do quite a lot of bioinformatics based work).

    Anyway, I've written two scripts for you to look at. I've kept the code simple and commented so you should be ok with it. Over a chromosome, this mightnt be as fast as it could be but should be ok.

    Firstly, you wont want to split the sequence into an array, unless you are absolutely sure you arent going to miss out a count on an odd number occurrance of the acid eg FAAAD would be split into FA-AA-D? or ?F-AA-AD so you would only count AA once whereas its actually got 2 pairs.

    This is the first script. This will find only AA pairs and count them. The sequence is ASDTDAAFRASEQSAAAFDG (its in the code) so the number of AA's should be 3.

    #!usr/bin/perl -w use strict; my $string = "ASDTDAAFRASEQSAAAFDG"; my $counter = 0; for(my $i = 0; $i < length($string)-1; $i++){ my $amino = substr($string, $i, 2); $counter++ if($amino eq "AA"); #print "$amino\n"; } print "number of AA matches = $counter\n";

    This is the second script, its a bit more complicated. Instead of only counting the number of AA's it will count all pairs. It creates these on the fly and if it encounters one it has already created it just increments the value. Just to note for the 22 possible amino acids the number of possible pairs will be much higher (484 I think)

    #!usr/bin/perl -w use strict; my $string = "ASDTDAAFRASEQSAAAFDG"; my %acids; for(my $i = 0; $i < length($string)-1; $i++){ my $amino = substr($string, $i, 2); if(exists $acids{$amino}){ $acids{$amino}++; }else{ $acids{$amino} = 1; } #print "$amino\n"; } print "These are the occurrence of all amino acid pairs including over +laps as separate counts\n"; foreach my $amino(keys %acids){ print "$amino\t$acids{$amino}\n"; }

    I hope either of these do what you want

      I have no bioinformatic background, but I'd like to offer a couple of comments on your code, specifically the version that counts overlapping letter pairs (would 'digrams' be an appropriate term for these?).

      my %acids; for(my $i = 0; $i < length($string)-1; $i++){ my $amino = substr($string, $i, 2); if(exists $acids{$amino}){ $acids{$amino}++; }else{ $acids{$amino} = 1; } #print "$amino\n"; }

      Because it is not necessary to check for the existence of a hash key before incrementing its value (due to autovivification), the body of this for-loop can be reduced to a single statement:
          ++$acids{ substr $string, $i, 2 }
      This will almost certainly yield a speed benefit.

      Alternatively, in 5.10+ versions of Perl, the entire for-loop can be replaced by a single regex (tested):
          $string =~ m{ (?= (..) (?{ ++$pairs2{$^N} }) (*FAIL)) }xms;
      This may or may not increase speed; you will have to Benchmark this for yourself. The alternate regex
          m{ (?= .. (?{ ++$pairs2{${^MATCH}} }) (*FAIL)) }xmsp
      also works (note the additional  /p regex modifier) and may be slightly faster because no capturing group is used. Again, Benchmark-ing will tell the tale.

      >perl -wMstrict -le "use Test::More tests => 2; use Data::Dump; ;; my $string = 'ABCCCDEAB'; ;; my %pairs1; $pairs1{$_}++ for $string =~ /(?=(..))/g; ;; local our %pairs2; $string =~ m{ (?= .. (?{ ++$pairs2{${^MATCH}} }) (*FAIL)) }xmsp; ;; my %pairs3; for (my $i = 0; $i < length($string) - 1; ++$i) { ++$pairs3{ substr $string, $i, 2 } } ;; dd \%pairs1, \%pairs2, \%pairs3; is_deeply \%pairs1, \%pairs2, '1 & 2, same results'; is_deeply \%pairs1, \%pairs3, '1 & 3, same results'; " 1..2 ( { AB => 2, BC => 1, CC => 2, CD => 1, DE => 1, EA => 1 }, { AB => 2, BC => 1, CC => 2, CD => 1, DE => 1, EA => 1 }, { AB => 2, BC => 1, CC => 2, CD => 1, DE => 1, EA => 1 }, ) ok 1 - 1 & 2, same results ok 2 - 1 & 3, same results
      So I went back to the start after posting this and essentially came up with your first option, I'm glad you recommended it as well because I was worried that it was childish/too roundabout. Thanks so much everyone. :)
Re: Using grep in a scalar context
by spandox (Novice) on Feb 06, 2013 at 13:38 UTC
    $FMat[0][0]=scalar(grep(/AA/, @dipept));
      Scalar context is created by the type of the lvalue in assignments. Adding scalar here does nothing - the scalar context is already there.
      لսႽ ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ
        :) it is unambiguous, and hurts nothing

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1017401]
Front-paged by Arunbear
and all is calm...

How do I use this? | Other CB clients
Other Users?
Others rifling through the Monastery: (4)
As of 2017-07-22 15:24 GMT
Find Nodes?
    Voting Booth?
    I came, I saw, I ...

    Results (340 votes). Check out past polls.