build a distribution

Grig has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: build a distribution by roboticus (Chancellor) on Aug 07, 2010 at 13:11 UTC
Grig: Sorry, your requirements aren't clear enough for a simple answer. First, I can't tell what your data file looks like because you didn't use code tags (<c>`insert code here`</c>). I can't tell if it's a single line of numbers, or a single number per line, or doubles, triples, ... Second, you don't specify what distribution(s) you're interested in, nor how to partition your bins. I'm not even certain of whether you're trying to generate some fake data for testing, or process data in some way. While I could make various guesses, it's doubtful that it would be helpful to you, and many monks here don't want to spend time on something that won't be of any use. Update your node a bit, clarify your question and requirements, and you should get some helpful results. It would be best if you try to code something up, and show us where you're having trouble. The more effort you put into your question, the better we can assist. ...roboticus	[reply] [d/l]
Re^2: build a distribution by Grig (Novice) on Aug 07, 2010 at 14:11 UTC
Dear roboticus, Thank you for your helpful remarks. I have inserted code tags. I'll try to clarify my task. I would like to separate the scored lengths into certain intervals. For example for the following data I would like to count the number of items that are less then 10, then the number of items between 10 and 20, 20 and 30, 30 and 40 and so on. Actually I need to build several distributions with different degree of detalisation. So the possible length of interval except 10 might be various 2, 6, 12 and and so on. `3 3 5 7 8 8 12 13 15 16 20 25 34 34 31 38 40 40` [download] the actual output should be something like this: `0-10 6 items 10-20 5 items 20-30 1 item 30-40 6 items` [download] Thank you once more.	[reply] [d/l] [select]
Re^3: build a distribution by roboticus (Chancellor) on Aug 07, 2010 at 14:28 UTC
Grig: OK, then the way I'd approach the task would be something like this: `my %bins; open my $INF, '<', $FileName or die $!; while (<$INF>) { chomp; $bins{get_bin($_)}++; } printf "%-6.6s %u items\n", $_, $bins{$_} for sort keys %bins; sub get_bin { # Determine the name of the bin to put the value into my $val = shift; my $bin_min = int($val / 10); my $bin_max = $bin_min + 10; return "$bin_min-$bin_max"; }` [download] You'll want to wrap in some error checking, testing, as well as any options you want... ...roboticus	[reply] [d/l]
Re^4: build a distribution by Grig (Novice) on Aug 07, 2010 at 18:29 UTC
Re^5: build a distribution by roboticus (Chancellor) on Aug 07, 2010 at 20:38 UTC
Re^3: build a distribution by toolic (Bishop) on Aug 07, 2010 at 14:29 UTC
It looks like you want a histogram generator. There are several available here at the Monastery. A Super Search, where title contains "histogram", yields (click on the link, then hit 'Search'): ?node_id=3989;HIT=histogram;re=N Perhaps you could adapt the code from one of the following: histogram Simple Text Histogram	[reply]
Re: build a distribution by BrowserUk (Patriarch) on Aug 07, 2010 at 17:11 UTC
You might find something like this useful. It plots line graphs of the sums at each interval, which might serve to help you select the best one for your purpose. The graph produced() using random data are pretty uninspiring, but serves its purpose. Large; you'll need to scroll or scale #! perl -slw use strict; use GD; use List::Util qw[ sum ]; sub rgb2n { unpack 'N', pack 'CCCC', 0, @_ } ## Gen some data my %counts; ++$counts{ int( rand( 5e3 ) + rand( 5e3 ) ) } for 1 .. 1e6; my @keys = sort{ $a <=> $b } keys %counts; my $gd = GD::Image->new( 10000, 2000, 1 ); for my $step ( reverse 1.. 10 ) { my $start = 0; my $last = 0; my $clr = 2*24 / $step; for( my $end = $start+$step-1; $end < $keys[ -1 ]; $end += $step ) + { my $sum = sum( @counts{ grep( defined $counts{ $_ }, $start..$ +end ) } ) // 0; # printf "%4d - %4d : %d\n", $start, $end, $sum; $gd->line( $start, 2000-$last, $end, 2000-$sum, $clr ) if $las +t; $start = $end + 1; $last = $sum; } } open IMG, '>:raw', 'junk14.png' or die $!; print IMG $gd->png; close IMG; ## Display the graph in the default image viewer system 'junk14.png'; [download] Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. RIP an inspiration; A true Folk's Guy	[reply] [d/l]
Re^2: build a distribution by sflitman (Hermit) on Aug 07, 2010 at 22:31 UTC
Very elegant! I ran the code on WinXP ActiveState perl and initially got a black image, but I called rgb2n on $clr and then it worked. SSF	[reply]
Re^3: build a distribution by BrowserUk (Patriarch) on Aug 07, 2010 at 23:20 UTC
I called rgb2n on $clr and then it worked. Hm. That is weird. And I mean re-ally weird!. `$clr` is (already) a number in the range 0 .. 224: `my $clr = 224 / $step;`, where $step interates from 10 .. 1. `rgb2n()` expects input of 3 numbers in the range 0 .. 255, which it then converts to a number in the range 0 .. 224. So, besides the crapload of warning that should have been issued, half the lines on the graph would have been drawn in black on black: c:\test>p1 [0] Perl> sub rgb2n { unpack 'N', pack 'CCCC', 0, @_ } ;; [0] Perl> print rgb2n( 224 / $_ ) for reverse 1 .. 10;; Character in 'C' format wrapped in pack at (eval 6) line 1, <STDIN> li +ne 3. 10027008 Character in 'C' format wrapped in pack at (eval 6) line 1, <STDIN> li +ne 3. 13041664 Character in 'C' format wrapped in pack at (eval 6) line 1, <STDIN> li +ne 3. 0 Character in 'C' format wrapped in pack at (eval 6) line 1, <STDIN> li +ne 3. 4784128 Character in 'C' format wrapped in pack at (eval 6) line 1, <STDIN> li +ne 3. 11141120 Character in 'C' format wrapped in pack at (eval 6) line 1, <STDIN> li +ne 3. 3342336 Character in 'C' format wrapped in pack at (eval 6) line 1, <STDIN> li +ne 3. 0 Character in 'C' format wrapped in pack at (eval 6) line 1, <STDIN> li +ne 3. 5570560 Character in 'C' format wrapped in pack at (eval 6) line 1, <STDIN> li +ne 3. 0 Character in 'C' format wrapped in pack at (eval 6) line 1, <STDIN> li +ne 3. 0 [download] The only possible explanation I can hazard at for this, is that you have a very (very(very)) old version of GD installed that only supports 8-bit color? Like pre-v2.0. The libgd history doesn't list dates, but from memory, it must at least 5 years ago. I provided a patch for 2.3x, and that was at least 4 years ago. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. RIP an inspiration; A true Folk's Guy	[reply] [d/l] [select]
Re: build a distribution by jethro (Monsignor) on Aug 07, 2010 at 13:16 UTC
Depends. Does your data still fit into memory? Do you need the distributions more than once or do you just stuff them into some statistics package or chart generator? Do you need the individual numbers in a bin or only the count? Do you know the different distributions you want beforehand or do you want interactivly change the lengths? Do you need more than one distribution simultaneously or is only the last distribution relevant? The following code deals with the simplest case: `sub finddistribution { my ($length, $numref)= @_; my @counts; foreach my $num (@$numref) { $counts[$num/$length]++; } return @counts; } ... my @dezimaldistri= finddistribution(10,\@nums);` [download]	[reply] [d/l]
Re^2: build a distribution by Grig (Novice) on Aug 07, 2010 at 14:28 UTC
Dear jethro, I'll try to answer your questions: 1) I indeed need the distributions more than once. 2) The individual numbers are not not required, only the count matters. 3) Unfortunately I don't know all the lengths of intervals beforehand. First of all I would like to get more general distribution with quite large interval just to see the whole picture and to divide it into smaller bins afterwards. 4) I would prefer to get only one distribution simultaneosly to analyse it carefully and then set another bin length if it is nesessary. Thank you!	[reply]
Re: build a distribution by Anonymous Monk on Aug 10, 2010 at 04:46 UTC
May I suggest Math::GSL in particular the Histogram module. Seems like stuff that you want http://search.cpan.org/~leto/Math-GSL-0.22/lib/Math/GSL/Histogram.pm	[reply]


Think about Loose Coupling
	PerlMonks