Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

comparing numbers from previous lines in a file?

by coding1227 (Novice)
on Nov 22, 2013 at 01:13 UTC ( [id://1063846]=perlquestion: print w/replies, xml ) Need Help??

coding1227 has asked for the wisdom of the Perl Monks concerning the following question:

Hi everyone I am trying to learn more about Perl and work with some weather data. I have an ascii file (shown below) that has 10 lines with different columns. What I would like is have Perl find an "anomalous" value by comparing a field with the values from the last 3 lines (for the same field, of course). For instance, if we look at the data below:

A15 26.62 765 27.30 4.3

A11 26.63 763 27.28 4.2

A12 26.68 767 27.29 4.3

A16 26.64 768 27.30 4.2

A11 26.62 761 27.31 4.1

A15 26.62 765 27.30 4.3

A15 26.63 763 27.28 4.2

A16 26.68 767 2.29 4.3

A17 26.64 768 27.30 4.2

A18 26.62 761 27.31 4.1

We see here that on the 8th line, the value in column 4 is "off" with respect to the previous lines (all of which are approx. 27.3 degrees Celsius). Thus, how can I do to compare the values for a line in column 4 with the values from the previous 3 lines in order to detect these "bad" data values? Assume that any value that dirfts +/- than 0.5deg C is "bad data" I am not sure how to proceed. Do I need to load the data into an array? Do I need to read the file line by line? Or maybe something else? Any help, suggestions, and especially examples would be really helpful. Thanks in advance Sorry... I forgot to mention that this is what I have in my script so far:
my $file = "$WORK_DIR/tmpdata.asc"; open my $info, $file or die "Could not open $file: $!"; my %line; local @ARGV = ($file); while ( <$info> ) { my $x1 = $_; my @cols = split(" ", $x1); print "@cols\n"; } close $info;

Replies are listed 'Best First'.
Re: comparing numbers from previous lines in a file?
by GrandFather (Saint) on Nov 22, 2013 at 04:59 UTC

    The following code checks all fields by identifying the value least like the others in the most recent few valid lines (invalid lines are those with bad values). Bad values are reported along with their line number.

    #!/usr/bin/perl use warnings; use strict; my $window = 3; my @datCols = qw(temp1 press temp2 rain); my @allCols = (@datCols, qw(line code)); my %limits = (temp1 => 0.5, press => 10, temp2 => 0.5, rain => 0.2); my @history; while (<DATA>) { my %row; (my $code, @row{@datCols}) = split " "; $row{code} = $code; $row{line} = $.; if (@history > 2) { for my $col (@datCols) { my ($outlier, $avg) = outlier($col, @history, \%row); next if abs($outlier->{$col} - $avg) < $limits{$col} * ($wi +ndow - 1); print "$col == $outlier->{$col} (c.f. $avg) in line $outli +er->{line}\n"; $outlier->{$col} = undef; } } push @history, \%row; # Toss out any bad data rows @history = grep { my $row = $_; ! grep {!defined $row->{$_}} @datCols } @history; shift @history if @history > 3; } sub outlier { my $field = shift @_; my @list = sort {$a->{$field} <=> $b->{$field}} grep {defined $_->{$field +}} @_; my $sum; my $outlier; return if !@list; $sum += $_->{$field} for @list; $sum /= @list; if (abs($sum - $list[0]->{$field}) > abs($sum - $list[-1]->{$field +})) { $outlier = $list[0]; } else { $outlier = $list[-1]; } $sum = ($sum * @list - $outlier->{$field}) / (@list - 1); return $outlier, $sum; } __DATA__ A15 26.62 765 27.30 4.3 A11 26.63 763 27.28 4.2 A12 26.68 767 27.1 4.3 A16 26.64 768 27.0 4.2 A11 26.62 761 26.8 4.1 A15 26.62 765 26.6 4.3 A15 26.63 763 2.4 4.2 A16 26.68 767 26.1 4.3 A17 26.64 768 25.9 4.2 A18 26.62 761 25.7 4.1

    Prints:

    temp2 == 2.4 (c.f. 26.8) in line 7
    True laziness is hard work
      ++ GrandFather for a much improved approach! And likewise to all who tackled this one.

      But -- I think -- even the averaging technique demonstrated is still subject to data variability (ie, variances less than the extreme example in OP; already critiqued in responses above) and ordering. Consider outcomes like
      temp2 == 25.29 (c.f. 27.53) in line 1
      for data where one line's variation is outside OP's 0.5C range but close to another line in which the data departs by an even smaller amount from OP's limit:

      __DATA__ A16 26.68 767 25.29 4.3 # T is outside 0.5 allowed A16 26.68 767 28.01 4.3 # T is also outside the permitted range A15 26.62 765 27.30 4.3 A11 26.63 763 27.28 4.2 A12 26.68 767 27.29 4.3 A16 26.64 768 27.30 4.2 A11 26.62 761 27.31 4.1 A15 26.62 765 27.30 4.3 A15 26.63 763 27.28 4.2 A17 26.64 768 27.30 4.2 A18 26.62 761 27.31 4.1

      Further, assume data variance which is only marginally outside the spec:

      __DATA__ A16 26.68 767 26.79 4.3 A16 26.68 767 27.81 4.3 A15 26.62 765 27.30 4.3 A11 26.63 763 27.28 4.2 A12 26.68 767 27.29 4.3 A16 26.64 768 27.30 4.2 A11 26.62 761 27.31 4.1 A15 26.62 765 27.30 4.3 A15 26.63 763 27.28 4.2 A17 26.64 768 27.30 4.2 A18 26.62 761 27.31 4.1

      In the case above, the code outputs no indication that both Ln 1 and Ln2 of the data depart OP's 27.3C by more than 0.5.

      -----------------

      Not tested, but I suspect the OP's window = 3 is too narrow; that for complete reliability (with data close-to-but-outside-the-spec and for certain orderings of the non-spec items in the overall data, one might need to average over the overall data (or at least over a window which takes in a majority of the items *1 , rather than over a narrower window.

      *1 Even this -- I speculate -- would rely on finding the anomalous data scattered thru the set, rather than grouped and having few enough out-of-spec items as to minimize the impact of the "bad data" on whatever range of items is used for the window.

      --------

      Yes, there's lottsa' handwaving above -- but defining such terms as 'minimize', 'narrower', 'close-to-but..." and so forth would entail (for /me, anyway, hours of thought (hard) and testing (consuming time I don't have today). OTOH, I do hope the above isn't utterly worthless because I hit a blind spot earlier.

        Hi Kenosis, GrandFather, tangent, kcott, and ww!

        thank you so much for warm welcome as well as your invaluable insights. Indeed, my apologies for not being more detailed with the problem in my previous message. Reading through your posts & insights made me realize that maybe I underestimated the "problem". :)

        So... what it boils down to is the following. I am having some weather data (air temperature) transmitted wirelessly from point A to point B. The "problem" is that sometimes, there is a character (or characters) missing in a line (perhaps due to interference?), causing the numbers to show differently from what they should be in reality.

        Hence, focusing only on field #4 for this example, sometimes the real data collected could look like this (ex1) but might actually be received as x2 due to drops of characters:

        Ex1:

        A15 26.62 765 27.30 4.3

        A11 26.63 763 27.28 4.2

        A12 26.68 767 27.29 4.3

        A16 26.64 768 27.30 4.2

        A11 26.62 761 27.31 4.1

        A15 26.62 765 27.30 4.3

        A15 26.63 763 27.28 4.2

        A16 26.68 767 2.29 4.3

        A17 26.64 768 27.30 4.2

        A18 26.62 761 27.31 4.1

        Ex2:

        A15 26.62 765 2.30 4.3

        A11 26.63 763 27.8 4.2

        A12 26.68 767 27.29 4.3

        A16 26.64 768 27.30 4.2

        A11 26.62 761 7.31 4.1

        A15 26.62 765 27.30 4.3

        A15 26.63 763 27.28 4.2

        A16 26.68 767 2.29 4.3

        A17 26.64 768 27.30 4.2

        A18 26.62 761 27.31 4.1

        There are a variety of factors to tackle, including checking that there is the correct number of fields in each line, and that each field has a value within a certain range of valid data (for example, in case of temperature, this would go from 0 through 45. Note that I've already written a small piece of code that takes care of this).

        As GrandFather and others very insightfully mentioned, if I was to use the "solution" approach that I proposed initially, one very real and potential problem would be to have an anomalous data value at the start of the transmitted data, and not have enough lines to compare it with.

        One thing to consider is that in reality I have 30 lines of data, and typically less than 5 lines have character dropouts in them (if any at all) (in my example I just included 10 for simplicity's sake). Therefore, I wonder if by taking an average, frequency test, or something like that would do the trick.

        What do you guys think?

        Ahhh... as I was writing this I just thought that perhaps, one additional check would be to ensure that column #4 has a total of 2 decimal places! (since the data is always transmitted with 2 decimal places)

        PS: kscott... hehe you are right. My apologies for my initial xy approach :)

Re: comparing numbers from previous lines in a file?
by kcott (Archbishop) on Nov 22, 2013 at 03:27 UTC

    G'day coding1227,

    Welcome to the monastery.

    First off, I need to say that this looks like an XY Problem. You've decided upon a solution, presented test data that fits that solution and asked how to code the solution. What you haven't done is fully explained the problem; in particular, you haven't said why you've chosen "the previous 3 lines" and what's special about the first three lines that guarantees that they will always contain "good" data values.

    If there's simply an omission in your post, and the solution is valid, then this logic does what you want:

    #!/usr/bin/env perl use strict; use warnings; use List::Util 'sum'; my $allowable_drift = 0.5; my @test_data; while (<DATA>) { my $test_field = (split)[3]; if (@test_data < 3) { print; push @test_data, $test_field; next; } my $avg = sum(@test_data) / 3; if (abs($avg - $test_field) <= $allowable_drift) { print; shift @test_data; push @test_data, $test_field; } else { print "Outlier: $_"; } } __DATA__ A15 26.62 765 27.30 4.3 A11 26.63 763 27.28 4.2 A12 26.68 767 27.29 4.3 A16 26.64 768 27.30 4.2 A11 26.62 761 27.31 4.1 A15 26.62 765 27.30 4.3 A15 26.63 763 27.28 4.2 A16 26.68 767 2.29 4.3 A17 26.64 768 27.30 4.2 A18 26.62 761 27.31 4.1

    Output:

    A15 26.62 765 27.30 4.3 A11 26.63 763 27.28 4.2 A12 26.68 767 27.29 4.3 A16 26.64 768 27.30 4.2 A11 26.62 761 27.31 4.1 A15 26.62 765 27.30 4.3 A15 26.63 763 27.28 4.2 Outlier: A16 26.68 767 2.29 4.3 A17 26.64 768 27.30 4.2 A18 26.62 761 27.31 4.1

    However, with this data (record 8 moved to record 2):

    __DATA__ A15 26.62 765 27.30 4.3 A16 26.68 767 2.29 4.3 A11 26.63 763 27.28 4.2 A12 26.68 767 27.29 4.3 A16 26.64 768 27.30 4.2 A11 26.62 761 27.31 4.1 A15 26.62 765 27.30 4.3 A15 26.63 763 27.28 4.2 A17 26.64 768 27.30 4.2 A18 26.62 761 27.31 4.1

    You get this output:

    A15 26.62 765 27.30 4.3 A16 26.68 767 2.29 4.3 A11 26.63 763 27.28 4.2 Outlier: A12 26.68 767 27.29 4.3 Outlier: A16 26.64 768 27.30 4.2 Outlier: A11 26.62 761 27.31 4.1 Outlier: A15 26.62 765 27.30 4.3 Outlier: A15 26.63 763 27.28 4.2 Outlier: A17 26.64 768 27.30 4.2 Outlier: A18 26.62 761 27.31 4.1

    I'll leave you to provide an update on the problem if that's appropriate. :-)

    -- Ken

Re: comparing numbers from previous lines in a file?
by Kenosis (Priest) on Nov 22, 2013 at 02:17 UTC

    What if you kept a running average of col4 and compared the delta to that average minus the current value of col4?

    use strict; use warnings; my $delta = .5; my ( $avg, $sum, $i ); while (<DATA>) { $i++; my @cols = split; if ( $i == 1 ) { $avg = $cols[3]; $sum = $avg; next; } if ( abs( $cols[3] - $avg ) > $delta ) { print "Warning, Will Robinson: $_"; $i--; next; } $sum += $cols[3]; $avg = $sum / $i; } __DATA__ A15 26.62 765 27.30 4.3 A11 26.63 763 27.28 4.2 A12 26.68 767 27.29 4.3 A16 26.64 768 27.30 4.2 A11 26.62 761 27.31 4.1 A15 26.62 765 27.30 4.3 A15 26.63 763 27.28 4.2 A16 26.68 767 2.29 4.3 A17 26.64 768 27.30 4.2 A18 26.62 761 27.31 4.1

    Output:

    Warning, Will Robinson: A16 26.68 767 2.29 4.3

    Hope this helps!

      It's harder than that. Consider what happens if you have data that is trending down, but where the steps are well within the 0.5 criteria:

      use strict; use warnings; my $delta = .5; my ($avg, $sum, $i); while (<DATA>) { $i++; my @cols = split; if ($i == 1) { $avg = $cols[3]; $sum = $avg; next; } if (abs($cols[3] - $avg) > $delta) { print "Warning, Will Robinson: $_"; $i--; next; } $sum += $cols[3]; $avg = $sum / $i; } __DATA__ A15 26.62 765 27.30 4.3 A11 26.63 763 27.28 4.2 A12 26.68 767 27.1 4.3 A16 26.64 768 27.0 4.2 A11 26.62 761 26.8 4.1 A15 26.62 765 26.6 4.3 A15 26.63 763 26.3 4.2 A16 26.68 767 26.0 4.3 A17 26.64 768 25.7 4.2 A18 26.62 761 25.4 4.1

      Prints:

      Warning, Will Robinson: A15 26.63 763 26.3 4.2 Warning, Will Robinson: A16 26.68 767 26.0 4.3 Warning, Will Robinson: A17 26.64 768 25.7 4.2 Warning, Will Robinson: A18 26.62 761 25.4 4.1
      True laziness is hard work

        Yes, good point. Had wondered whether I'd oversimplified the issue...

        Edit: On second thought, it might not be "harder than that". The OP did say, "all of which are approx. 27.3 degrees Celsius", so it depends solely upon the OP's data set. There may never be an instance of such a downtrending of values--especially the way the OP characterized the data.

Re: comparing numbers from previous lines in a file?
by tangent (Parson) on Nov 22, 2013 at 02:41 UTC
    This is a bit trickier than it may seem at first sight. What do you do with the first three lines as you don't have three previous lines to compare them with. And what if there is more than one "bad value".

    Perhaps a better approach would be to find the "mode" of all the values, i.e. the most common value, and then compare each value with that.

    Update: You can ignore all of this, "mode" is not the way to go

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1063846]
Approved by GrandFather
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others scrutinizing the Monastery: (7)
As of 2024-03-28 18:56 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found