Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

How to best eliminate values in a list that are outliers

by Anonymous Monk
on Nov 09, 2015 at 21:25 UTC ( #1147296=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hello monks, I am looking to take a list of numbers and identify the "global min" of each set, and then subsequently print the difference between each item of the set and the established "global min". I'm fine at this procedure, but I am struggling to code a method of eliminating outliers from inclusion

the following is example data that is not an issue

A 4 A 4 A 3 A 2 B 1 B 5 B 6

Here the "global min" for "A" and "B", are 2 and 1 respectively and I can use these values to find difference between each instance( line) of A or B

The following is data example of two situations causing me issues

C 1 C 80000 C 2 C 4 C 1200 D .1 D 1500 D 1700 D 2100 D 3200

In C, the global min is fine, but outlier 80,000 will skew results and should be removed. In D, .1 will be set as the "global min" but it is a mistake and 1500 should be set as the true global min.

Any and all thoughts would be very appreciated!! Thanks

Replies are listed 'Best First'.
Re: How to best eliminate values in a list that are outliers
by Athanasius (Bishop) on Nov 10, 2015 at 03:56 UTC
    In D, .1 will be set as the "global min" but it is a mistake and 1500 should be set as the true global min.

    To eleborate on ww’s response:

    As LanX says, the standard test for outliers is Grubbs’s test. There is an online calculator for Grubbs’s test at http://www.graphpad.com/quickcalcs/grubbs2/, and entering your sample data for series “D” — with Alpha set to either 0.05 or 0.01 — produces the following result :

    Row Value Z Significant Outlier?
    1 0.1 1.471 Furthest from the rest, but not a significant outlier (P > 0.05).
    2 1500.0 0.173
    3 1700.0 0.000
    4 2100.0 0.346
    5 3200.0 1.298

    So, why do you identify 0.1 as an outlier?

    Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

      Presumably, because his data does not represent some quantity to approximate normal distribution.

      Transforming the D with f = 1 / x yields:
      Row Value Z Significant Outlier?
      1 10.000000000000000000 1.78885438120805620000 Significant outlier. P < 0.05
      2 0.000666666666666667 0.44717876261411860000  
      3 0.000588235294117647 0.44719630129821775000  
      4 0.000476190476190476 0.44722135656121640000  
      5 0.000312500000000000 0.44725796073450363000  

      Transforming the D with f = log x gives:
      Row Value Z Significant Outlier?
      1 -2.30258509299405 1.7851028840345886 Significant outlier. P < 0.05
      2 7.31322038709030 0.3777123990128823  
      3 7.43838353004431 0.4058644616784443  
      4 7.64969262371151 0.4533927252416877  
      5 8.07090608878782 0.5481332981015742  

Re: How to best eliminate values in a list that are outliers
by karlgoethebier (Monsignor) on Nov 09, 2015 at 22:16 UTC
    "...all thoughts would be very appreciated"

    Doesn't Statistics::Descriptive provide something to handle outliers?

    Regards, Karl

    «The Crux of the Biscuit is the Apostrophe»

Re: How to best eliminate values in a list that are outliers
by ww (Archbishop) on Nov 10, 2015 at 03:02 UTC
    "In D, .1 will be set as the "global min" but it is a mistake and 1500 should be set as the true global min."

    How do you know (or how and when did you decide) that 0.1 is an error? How should we know that? Pls see I know what I mean. Why don't you?

    In the same vein [ more or less... :-) ] how much will you allow your data set to vary (range) before wanting a value treated as an outlier and removed? IOW, is the judgement statistical-arithmetic or is it subjective?


    ++$anecdote ne $data

Re: How to best eliminate values in a list that are outliers
by LanX (Archbishop) on Nov 09, 2015 at 21:31 UTC
      Thanks for the reply! I realize the theory behind what I want to do, but I am not very good at coding so I was more looking for recommendation on what perl commands / modules I can use to do this.

        If you have a good understanding of the theory behind detecting the outliers you should be able to describe a series of steps to detect them. Example data helps, but it is the criteria used to make the decision that needs to be described.

        Once you have the series of steps used to make the decision (pretend you are following the steps by hand using pen and paper), you should then be able to code the solution, or at least get some help coding it.

        Premature optimization is the root of all job security
        > but I am not very good at coding

        Alas .... unfortunately none of us is! :(

        Cheers Rolf
        (addicted to the Perl Programming Language and ☆☆☆☆ :)
        Je suis Charlie!

Re: How to best eliminate values in a list that are outliers
by kevbot (Priest) on Nov 10, 2015 at 03:22 UTC
    Here is a blog post that may be helpful: Finding Outliers in Numerical Data. The article focuses on packages/implementations in R; however, the article provides some good background information on 4 different ways to identify outliers.
      Here is an attempt at implementing the Hampel identifier method that is mentioned in the article. However, reliable identification of outliers is problematic with so few datapoints.
      #!/usr/bin/env perl use strict; use warnings; use Statistics::Descriptive; use List::Util qw/min max/; my $stat = Statistics::Descriptive::Full->new(); #my @data = (4, 4, 3, 2); # "A" data #my @data = (1, 5, 6); # "B" data #my @data = (1, 80000, 2, 4, 1200); # "C" data my @data = (0.1, 1500, 1700, 2100, 3200); # "D" data print "Starting data: ", join(", ", @data), "\n\n"; $stat->add_data(@data); # References # http://exploringdatablog.blogspot.com/2013/02/finding-outliers-in-nu +merical-data.html # https://en.wikipedia.org/wiki/Median_absolute_deviation my $median = $stat->median(); my @abs_res = map { abs($median - $_) } @data; my $abs_res_stat = Statistics::Descriptive::Full->new(); $abs_res_stat->add_data(@abs_res); my $MAD = $abs_res_stat->median(); my $t = 3; my $lower_limit = $median-$t*$MAD; my $upper_limit = $median+$t*$MAD; print " Median: $median\n"; print " MAD: $MAD\n"; print " t: $t\n\n"; print "Lower limit: $lower_limit\n"; print "Upper Limit: $upper_limit\n\n"; my @filtered_data; foreach my $datum (@data) { my $is_outlier = (($datum < $lower_limit) or ($datum > $upper_limi +t)) ? 1 : 0; unless($is_outlier) { push @filtered_data, $datum }; } print "Filtered data: ", join(", ", @filtered_data), "\n\n"; print "Minimum value of filtered data is: ", min(@filtered_data), "\n +"; exit;

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1147296]
Approved by philipbailey
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others musing on the Monastery: (5)
As of 2019-11-17 12:31 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Strict and warnings: which comes first?



    Results (86 votes). Check out past polls.

    Notices?