Beefy Boxes and Bandwidth Generously Provided by pair Networks
Come for the quick hacks, stay for the epiphanies.
 
PerlMonks  

Re^2: extract values from a field that are consecutive and within one standard deviation of each other

by Myrddin Wyllt (Hermit)
on Jul 15, 2015 at 16:56 UTC ( [id://1134931]=note: print w/replies, xml ) Need Help??


in reply to Re: extract values from a field that are consecutive and within one standard deviation of each other
in thread extract values from a field that are consecutive and within one standard deviation of each other

This will produce blocks where consecutive values are within range, but not necessarily where all values are within range - for example, the second block in the output ranges from 1 to 2 for a required maximum deviation of 0.5.

To produce the desired result, you need to keep track of the maximum and minimum values within the candidate set, and test against these.

Also, if you drop the whole candidate set as soon as the latest value doesn't fit, you will miss sequences which start in the middle of the current set but don't include the early values - you need to add the latest value to the end and then chop values off the beginning until you get back to a qualifying set. This may end up being just the latest value, but it could be larger.

The following code shows one way to do this. It uses the same changed block size and deviation as tangent's example, and adds some more data to illustrate overlapping qualifying sets

use strict; use warnings; # Set block size and deviation my $block_size = 4; my $deviation = 0.5; # Initialise the candidates, maximum and minimum values my $candidates = [[split(' ', <DATA>)]]; my $maxval = my $minval = $candidates->[0][3]; # Loop through the rest of the DATA while (<DATA>) { my $current = [split]; if (within_range($current->[3], $maxval, $minval)) { push(@$candidates, $current); $maxval = $current->[3] if $maxval < $current->[3]; $minval = $current->[3] if $minval > $current->[3]; } elsif (@$candidates >= $block_size) { print_block($candidates); push(@$candidates, $current); ($candidates, $maxval, $minval) = trim_candidates($candidates); } else { push(@$candidates, $current); ($candidates, $maxval, $minval) = trim_candidates($candidates); } } # deal with stragglers if ( @$candidates >= $block_size ) { print_block( $candidates ); } sub within_range { my ($testval, $testmax, $testmin) = @_; return 0 if $testmax - $testval > $deviation; return 0 if $testval - $testmin > $deviation; return 1; } sub print_block { my ( $lines ) = @_; print "BLOCK\n"; for my $line ( @$lines ) { print join(' ', @$line ), "\n"; } } sub trim_candidates { my $worklist = shift; # drop the first entry shift @$worklist; # Check if the remaining worklist qualifies my $workmax = my $workmin = $worklist->[0][3]; foreach my $item (@$worklist) { return trim_candidates($worklist) unless within_range($item->[3], $workmax, $workmin); $workmax = $item->[3] if $workmax < $item->[3]; $workmin = $item->[3] if $workmin > $item->[3]; } return ($worklist, $workmax, $workmin); } 0; __DATA__ 1 10492 rs55998931 0.272727272727273 0.4375 1 13418 . 0.25 0.0625 1 13752 . 0.153846153846154 0.25 1 13813 . 0.0357142857142857 0.2 1 13838 . 0.0357142857142857 0.2 1 14907 rs79585140 0.5 0.555555555555556 1 14930 rs75454623 0.535714285714286 0.611111111111111 1 14933 rs199856693 0.0357142857142857 0.0555555555555556 1 14948 rs201855936 0.107142857142857 0 1 10492 rs55998931 1 0.4375 1 10492 rs55998931 1.5 0.4375 1 10492 rs55998931 1.9 0.4375 1 10492 rs55998931 2 0.4375 1 10492 rs55998931 2.6 0.4375 1 13418 blah 20.0 blah 1 13418 blah 20.1 blah 1 13418 blah 20.2 blah 1 13418 blah 20.3 blah 1 13418 blah 20.4 blah 1 13418 blah 20.5 blah 1 13418 blah 20.6 blah 1 13418 blah 20.7 blah 1 13418 blah 30.5 blah 1 13418 blah 30.0 blah 1 13418 blah 30.0 blah 1 13418 blah 30.5 blah 1 13418 blah 30.6 blah 1 13418 blah 30.9 blah 1 13418 blah 30.6 blah 1 13418 blah 30.9 blah

Output:

BLOCK 1 10492 rs55998931 0.272727272727273 0.4375 1 13418 . 0.25 0.0625 1 13752 . 0.153846153846154 0.25 1 13813 . 0.0357142857142857 0.2 1 13838 . 0.0357142857142857 0.2 1 14907 rs79585140 0.5 0.555555555555556 BLOCK 1 13418 blah 20.0 blah 1 13418 blah 20.1 blah 1 13418 blah 20.2 blah 1 13418 blah 20.3 blah 1 13418 blah 20.4 blah 1 13418 blah 20.5 blah BLOCK 1 13418 blah 20.1 blah 1 13418 blah 20.2 blah 1 13418 blah 20.3 blah 1 13418 blah 20.4 blah 1 13418 blah 20.5 blah 1 13418 blah 20.6 blah BLOCK 1 13418 blah 20.2 blah 1 13418 blah 20.3 blah 1 13418 blah 20.4 blah 1 13418 blah 20.5 blah 1 13418 blah 20.6 blah 1 13418 blah 20.7 blah BLOCK 1 13418 blah 30.5 blah 1 13418 blah 30.0 blah 1 13418 blah 30.0 blah 1 13418 blah 30.5 blah BLOCK 1 13418 blah 30.5 blah 1 13418 blah 30.6 blah 1 13418 blah 30.9 blah 1 13418 blah 30.6 blah 1 13418 blah 30.9 blah
  • Comment on Re^2: extract values from a field that are consecutive and within one standard deviation of each other
  • Select or Download Code

Replies are listed 'Best First'.
Re^3: extract values from a field that are consecutive and within one standard deviation of each other
by tangent (Parson) on Jul 16, 2015 at 14:31 UTC
    Nicely done!

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1134931]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others examining the Monastery: (8)
As of 2024-04-18 16:30 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found