Grig has asked for the wisdom of the Perl Monks concerning the following question:
Hello dear perl monks!
It might be a rather simple question but it is quite complicated fo me as I have started to study perl only this morning. So in fact I am a molecular biologist.
The question itself is the following:
I have a huge number of scored distances between two types of genes. A small part of it looks like this:
2483
2490
2494
2496
2500
2501
2508
2517
2518
2519
2527
2530
2541
2541
2542
2555
2557
2561
2562
2565
2572
2572
2575
2582
2585
2588
2589
2597
2598
2603
2604
2608
2611
2620
2632
2642
2643
2645
2647
2649
2651
2659
2661
2664
2667
2669
2670
2673
2675
2677
I would like to build several distributions of these lengths with different bins. So, what will be the propper way to devide these data into certain intervals and to count the number of elements in each interval using perl?
I will be very grateful if you could help.
Re: build a distribution
by roboticus (Chancellor) on Aug 07, 2010 at 13:11 UTC
|
Grig:
Sorry, your requirements aren't clear enough for a simple answer. First, I can't tell what your data file looks like because you didn't use code tags (<c>insert code here</c>). I can't tell if it's a single line of numbers, or a single number per line, or doubles, triples, ...
Second, you don't specify what distribution(s) you're interested in, nor how to partition your bins. I'm not even certain of whether you're trying to generate some fake data for testing, or process data in some way.
While I could make various guesses, it's doubtful that it would be helpful to you, and many monks here don't want to spend time on something that won't be of any use. Update your node a bit, clarify your question and requirements, and you should get some helpful results. It would be best if you try to code something up, and show us where you're having trouble. The more effort you put into your question, the better we can assist.
...roboticus
| [reply] [d/l] |
|
3
3
5
7
8
8
12
13
15
16
20
25
34
34
31
38
40
40
the actual output should be something like this:
0-10 6 items
10-20 5 items
20-30 1 item
30-40 6 items
Thank you once more. | [reply] [d/l] [select] |
|
my %bins;
open my $INF, '<', $FileName or die $!;
while (<$INF>) {
chomp;
$bins{get_bin($_)}++;
}
printf "%-6.6s %u items\n", $_, $bins{$_} for sort keys %bins;
sub get_bin {
# Determine the name of the bin to put the value into
my $val = shift;
my $bin_min = int($val / 10);
my $bin_max = $bin_min + 10;
return "$bin_min-$bin_max";
}
You'll want to wrap in some error checking, testing, as well as any options you want...
...roboticus
| [reply] [d/l] |
|
|
|
| [reply] |
Re: build a distribution
by BrowserUk (Patriarch) on Aug 07, 2010 at 17:11 UTC
|
You might find something like this useful. It plots line graphs of the sums at each interval, which might serve to help you select the best one for your purpose. The graph produced(*) using random data are pretty uninspiring, but serves its purpose.
Large; you'll need to scroll or scale
#! perl -slw
use strict;
use GD;
use List::Util qw[ sum ];
sub rgb2n { unpack 'N', pack 'CCCC', 0, @_ }
## Gen some data
my %counts;
++$counts{ int( rand( 5e3 ) + rand( 5e3 ) ) } for 1 .. 1e6;
my @keys = sort{ $a <=> $b } keys %counts;
my $gd = GD::Image->new( 10000, 2000, 1 );
for my $step ( reverse 1.. 10 ) {
my $start = 0;
my $last = 0;
my $clr = 2**24 / $step;
for( my $end = $start+$step-1; $end < $keys[ -1 ]; $end += $step )
+ {
my $sum = sum( @counts{ grep( defined $counts{ $_ }, $start..$
+end ) } ) // 0;
# printf "%4d - %4d : %d\n", $start, $end, $sum;
$gd->line( $start, 2000-$last, $end, 2000-$sum, $clr ) if $las
+t;
$start = $end + 1;
$last = $sum;
}
}
open IMG, '>:raw', 'junk14.png' or die $!;
print IMG $gd->png;
close IMG;
## Display the graph in the default image viewer
system 'junk14.png';
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] |
|
Very elegant! I ran the code on WinXP ActiveState perl and initially got a black image, but I called rgb2n on $clr and then it worked.
SSF
| [reply] |
|
I called rgb2n on $clr and then it worked.
Hm. That is weird. And I mean re-ally weird!.
$clr is (already) a number in the range 0 .. 2**24: my $clr = 2**24 / $step;, where $step interates from 10 .. 1.
rgb2n() expects input of 3 numbers in the range 0 .. 255, which it then converts to a number in the range 0 .. 2**24. So, besides the crapload of warning that should have been issued, half the lines on the graph would have been drawn in black on black:
c:\test>p1
[0] Perl> sub rgb2n { unpack 'N', pack 'CCCC', 0, @_ }
;;
[0] Perl> print rgb2n( 2**24 / $_ ) for reverse 1 .. 10;;
Character in 'C' format wrapped in pack at (eval 6) line 1, <STDIN> li
+ne 3.
10027008
Character in 'C' format wrapped in pack at (eval 6) line 1, <STDIN> li
+ne 3.
13041664
Character in 'C' format wrapped in pack at (eval 6) line 1, <STDIN> li
+ne 3.
0
Character in 'C' format wrapped in pack at (eval 6) line 1, <STDIN> li
+ne 3.
4784128
Character in 'C' format wrapped in pack at (eval 6) line 1, <STDIN> li
+ne 3.
11141120
Character in 'C' format wrapped in pack at (eval 6) line 1, <STDIN> li
+ne 3.
3342336
Character in 'C' format wrapped in pack at (eval 6) line 1, <STDIN> li
+ne 3.
0
Character in 'C' format wrapped in pack at (eval 6) line 1, <STDIN> li
+ne 3.
5570560
Character in 'C' format wrapped in pack at (eval 6) line 1, <STDIN> li
+ne 3.
0
Character in 'C' format wrapped in pack at (eval 6) line 1, <STDIN> li
+ne 3.
0
The only possible explanation I can hazard at for this, is that you have a very (very(very)) old version of GD installed that only supports 8-bit color?
Like pre-v2.0. The libgd history doesn't list dates, but from memory, it must at least 5 years ago. I provided a patch for 2.3x, and that was at least 4 years ago.
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] [select] |
Re: build a distribution
by jethro (Monsignor) on Aug 07, 2010 at 13:16 UTC
|
Depends. Does your data still fit into memory? Do you need the distributions more than once or do you just stuff them into some statistics package or chart generator? Do you need the individual numbers in a bin or only the count? Do you know the different distributions you want beforehand or do you want interactivly change the lengths? Do you need more than one distribution simultaneously or is only the last distribution relevant?
The following code deals with the simplest case:
sub finddistribution {
my ($length, $numref)= @_;
my @counts;
foreach my $num (@$numref) {
$counts[$num/$length]++;
}
return @counts;
}
...
my @dezimaldistri= finddistribution(10,\@nums);
| [reply] [d/l] |
|
Dear jethro,
I'll try to answer your questions:
1) I indeed need the distributions more than once.
2) The individual numbers are not not required, only the count matters.
3) Unfortunately I don't know all the lengths of intervals beforehand. First of all I would like to get more general distribution with quite large interval just to see the whole picture and to divide it into smaller bins afterwards.
4) I would prefer to get only one distribution simultaneosly to analyse it carefully and then set another bin length if it is nesessary.
Thank you!
| [reply] |
Re: build a distribution
by Anonymous Monk on Aug 10, 2010 at 04:46 UTC
|
| [reply] |
|
|