Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

build a distribution

by Grig (Novice)
on Aug 07, 2010 at 11:58 UTC ( [id://853554]=perlquestion: print w/replies, xml ) Need Help??

Grig has asked for the wisdom of the Perl Monks concerning the following question:

Hello dear perl monks!

It might be a rather simple question but it is quite complicated fo me as I have started to study perl only this morning. So in fact I am a molecular biologist. The question itself is the following:

I have a huge number of scored distances between two types of genes. A small part of it looks like this:

2483 2490 2494 2496 2500 2501 2508 2517 2518 2519 2527 2530 2541 2541 2542 2555 2557 2561 2562 2565 2572 2572 2575 2582 2585 2588 2589 2597 2598 2603 2604 2608 2611 2620 2632 2642 2643 2645 2647 2649 2651 2659 2661 2664 2667 2669 2670 2673 2675 2677

I would like to build several distributions of these lengths with different bins. So, what will be the propper way to devide these data into certain intervals and to count the number of elements in each interval using perl?

I will be very grateful if you could help.

Replies are listed 'Best First'.
Re: build a distribution
by roboticus (Chancellor) on Aug 07, 2010 at 13:11 UTC

    Grig:

    Sorry, your requirements aren't clear enough for a simple answer. First, I can't tell what your data file looks like because you didn't use code tags (<c>insert code here</c>). I can't tell if it's a single line of numbers, or a single number per line, or doubles, triples, ...

    Second, you don't specify what distribution(s) you're interested in, nor how to partition your bins. I'm not even certain of whether you're trying to generate some fake data for testing, or process data in some way.

    While I could make various guesses, it's doubtful that it would be helpful to you, and many monks here don't want to spend time on something that won't be of any use. Update your node a bit, clarify your question and requirements, and you should get some helpful results. It would be best if you try to code something up, and show us where you're having trouble. The more effort you put into your question, the better we can assist.

    ...roboticus

      Dear roboticus,

      Thank you for your helpful remarks. I have inserted code tags.

      I'll try to clarify my task. I would like to separate the scored lengths into certain intervals. For example for the following data I would like to count the number of items that are less then 10, then the number of items between 10 and 20, 20 and 30, 30 and 40 and so on. Actually I need to build several distributions with different degree of detalisation. So the possible length of interval except 10 might be various 2, 6, 12 and and so on.
      3 3 5 7 8 8 12 13 15 16 20 25 34 34 31 38 40 40

      the actual output should be something like this:

      0-10 6 items 10-20 5 items 20-30 1 item 30-40 6 items
      Thank you once more.

        Grig:

        OK, then the way I'd approach the task would be something like this:

        my %bins; open my $INF, '<', $FileName or die $!; while (<$INF>) { chomp; $bins{get_bin($_)}++; } printf "%-6.6s %u items\n", $_, $bins{$_} for sort keys %bins; sub get_bin { # Determine the name of the bin to put the value into my $val = shift; my $bin_min = int($val / 10); my $bin_max = $bin_min + 10; return "$bin_min-$bin_max"; }

        You'll want to wrap in some error checking, testing, as well as any options you want...

        ...roboticus

Re: build a distribution
by BrowserUk (Patriarch) on Aug 07, 2010 at 17:11 UTC

    You might find something like this useful. It plots line graphs of the sums at each interval, which might serve to help you select the best one for your purpose. The graph produced(*) using random data are pretty uninspiring, but serves its purpose.

    Large; you'll need to scroll or scale

    #! perl -slw use strict; use GD; use List::Util qw[ sum ]; sub rgb2n { unpack 'N', pack 'CCCC', 0, @_ } ## Gen some data my %counts; ++$counts{ int( rand( 5e3 ) + rand( 5e3 ) ) } for 1 .. 1e6; my @keys = sort{ $a <=> $b } keys %counts; my $gd = GD::Image->new( 10000, 2000, 1 ); for my $step ( reverse 1.. 10 ) { my $start = 0; my $last = 0; my $clr = 2**24 / $step; for( my $end = $start+$step-1; $end < $keys[ -1 ]; $end += $step ) + { my $sum = sum( @counts{ grep( defined $counts{ $_ }, $start..$ +end ) } ) // 0; # printf "%4d - %4d : %d\n", $start, $end, $sum; $gd->line( $start, 2000-$last, $end, 2000-$sum, $clr ) if $las +t; $start = $end + 1; $last = $sum; } } open IMG, '>:raw', 'junk14.png' or die $!; print IMG $gd->png; close IMG; ## Display the graph in the default image viewer system 'junk14.png';

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      Very elegant! I ran the code on WinXP ActiveState perl and initially got a black image, but I called rgb2n on $clr and then it worked.

      SSF

        I called rgb2n on $clr and then it worked.

        Hm. That is weird. And I mean re-ally weird!.

        $clr is (already) a number in the range 0 .. 2**24: my $clr = 2**24 / $step;, where $step interates from 10 .. 1.

        rgb2n() expects input of 3 numbers in the range 0 .. 255, which it then converts to a number in the range 0 .. 2**24. So, besides the crapload of warning that should have been issued, half the lines on the graph would have been drawn in black on black:

        c:\test>p1 [0] Perl> sub rgb2n { unpack 'N', pack 'CCCC', 0, @_ } ;; [0] Perl> print rgb2n( 2**24 / $_ ) for reverse 1 .. 10;; Character in 'C' format wrapped in pack at (eval 6) line 1, <STDIN> li +ne 3. 10027008 Character in 'C' format wrapped in pack at (eval 6) line 1, <STDIN> li +ne 3. 13041664 Character in 'C' format wrapped in pack at (eval 6) line 1, <STDIN> li +ne 3. 0 Character in 'C' format wrapped in pack at (eval 6) line 1, <STDIN> li +ne 3. 4784128 Character in 'C' format wrapped in pack at (eval 6) line 1, <STDIN> li +ne 3. 11141120 Character in 'C' format wrapped in pack at (eval 6) line 1, <STDIN> li +ne 3. 3342336 Character in 'C' format wrapped in pack at (eval 6) line 1, <STDIN> li +ne 3. 0 Character in 'C' format wrapped in pack at (eval 6) line 1, <STDIN> li +ne 3. 5570560 Character in 'C' format wrapped in pack at (eval 6) line 1, <STDIN> li +ne 3. 0 Character in 'C' format wrapped in pack at (eval 6) line 1, <STDIN> li +ne 3. 0

        The only possible explanation I can hazard at for this, is that you have a very (very(very)) old version of GD installed that only supports 8-bit color?

        Like pre-v2.0. The libgd history doesn't list dates, but from memory, it must at least 5 years ago. I provided a patch for 2.3x, and that was at least 4 years ago.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: build a distribution
by jethro (Monsignor) on Aug 07, 2010 at 13:16 UTC

    Depends. Does your data still fit into memory? Do you need the distributions more than once or do you just stuff them into some statistics package or chart generator? Do you need the individual numbers in a bin or only the count? Do you know the different distributions you want beforehand or do you want interactivly change the lengths? Do you need more than one distribution simultaneously or is only the last distribution relevant?

    The following code deals with the simplest case:

    sub finddistribution { my ($length, $numref)= @_; my @counts; foreach my $num (@$numref) { $counts[$num/$length]++; } return @counts; } ... my @dezimaldistri= finddistribution(10,\@nums);
      Dear jethro,

      I'll try to answer your questions:

      1) I indeed need the distributions more than once.

      2) The individual numbers are not not required, only the count matters.

      3) Unfortunately I don't know all the lengths of intervals beforehand. First of all I would like to get more general distribution with quite large interval just to see the whole picture and to divide it into smaller bins afterwards.

      4) I would prefer to get only one distribution simultaneosly to analyse it carefully and then set another bin length if it is nesessary.

      Thank you!
Re: build a distribution
by Anonymous Monk on Aug 10, 2010 at 04:46 UTC

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://853554]
Approved by marto
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others browsing the Monastery: (3)
As of 2024-04-24 02:17 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found