Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Howto count elements within an interval

by lomSpace (Scribe)
on Nov 04, 2010 at 22:00 UTC ( #869565=perlquestion: print w/ replies, xml ) Need Help??
lomSpace has asked for the wisdom of the Perl Monks concerning the following question:

Hello,
I need to count elements within every 100k range along a chromosome. I understand
how to parse the file and get the columns that I want to work on. What I don't get
is how to count each numeric value that falls in each 100k range. Each value in column
two represents is to be counted within a 100k range. Then I repeat this until I have recorded
the number of values in column two per 100k range.

#!/usr/bin/perl -w # examine the value. When the value reaches 100k # print out the count of the word "gene" in the keys for 1..100k range +. # Then increment and perform the same operation for the range 101k..20 +0k. # Do this until you reach the final 100k increment. use strict; #Open the file. Use indexes 2 and 4. open( my $in, "chr8.txt" ); open( my $out, ">/Users/mgavibrathwaite/Desktop/genecoord.txt"); my %genes_per_100k; while(<DATA>) { next if /\#\#.+/; chomp; my @fields = split /\t/; my ($genes,$gene_end) = ($fields[2], $fields[4]); #print $out "$genes\t$gene_end\n"; #=cut if($genes =~ /gene/){ print $out "$genes\t$gene_end\n"; #$genes_per_100k{$gene_end}; } } close($in); close($out); __DATA__ gene 3936 gene 7591 gene 13082 gene 23200 gene 32518 gene 45123 gene 57330 gene 62384 gene 66839 gene 71715 gene 83427 gene 90948 gene 87510 gene 96042 gene 106380 gene 108247 gene 109395 gene 120121 gene 138410 gene 143225 gene 147455 gene 152452 gene 155580 gene 158939 gene 163483 gene 167583 gene 178450 gene 181546 gene 184301 gene 193505 gene 190880 gene 199431 gene 202844

Comment on Howto count elements within an interval
Download Code
Re: Howto count elements within an interval
by Anonymous Monk on Nov 04, 2010 at 22:16 UTC

    my $bucketIndex = int(($position-1) / 1e6);

Re: Howto count elements within an interval
by JavaFan (Canon) on Nov 04, 2010 at 22:21 UTC
    Something like:
    use YAML; my %count_per_100k; while (<DATA>) { my ($text, $count) = split; next unless $text =~ /gene/; $count_per_100k{int($count / 100_000)}++; } print Dump \%count_per_100k; __DATA__ ...
    Output:
    --- 0: 14 1: 18 2: 1
    At least, I think that's your question. Perhaps you want to do something different.
      JavaFan,
      That is the answer to my questions. Also, any advice on getting a better
      grip on the power of hashes?

      Thanks!
      LomSpace
Re: Howto count elements within an interval
by aquarium (Curate) on Nov 05, 2010 at 01:08 UTC
    so like every number you read adds to the count for each incremental 100k bucket? e.g. seeing 202844 on input adds to the count for the bucket for the 200k to 300k range.
    you could either write this the old fashioned way testing how many times you can divide by 100k before you go to a result below 1.0. OR the much more fun way is to pre-process the numbers by rounding down to nearest 100k, then drop all the right hand side zeros, and count these (now simple) integers, i.e. every number from 0 to 100k becomes nothing, which (when used in perl number context) becomes zero again, so all these zeros go to first bucket, etc.
    the hardest line to type correctly is: stty erase ^H
      Just for fun:

      #!/usr/bin/perl use strict; use warnings; my %cnt100kRange; while (<DATA>) { chomp; my ($text, $count) = split; next unless $text =~ /gene/; $cnt100kRange{ substr( sprintf("%06d",$count),0,1) }++; } foreach my $range (sort keys %cnt100kRange) { print "$range: $cnt100kRange{$range}\n"; } __DATA__ gene 3936 gene 7591 gene 13082 gene 23200 gene 32518 gene 45123 gene 57330 gene 62384 gene 66839 gene 71715 gene 83427 gene 90948 gene 87510 gene 96042 gene 106380 gene 108247 gene 109395 gene 120121 gene 138410 gene 143225 gene 147455 gene 152452 gene 155580 gene 158939 gene 163483 gene 167583 gene 178450 gene 181546 gene 184301 gene 193505 gene 190880 gene 199431 gene 202844
        Bibliophile,
        That is the answer to my questions. Also, any advice on getting a better
        grip on the power of hashes?

        Thanks!
        LomSpace

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://869565]
Approved by planetscape
Front-paged by aquarium
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others examining the Monastery: (12)
As of 2014-07-30 06:18 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (229 votes), past polls