Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

How would you code this?

by BrowserUk (Pope)
on Apr 06, 2016 at 18:48 UTC ( #1159749=perlquestion: print w/replies, xml ) Need Help??

BrowserUk has asked for the wisdom of the Perl Monks concerning the following question:

The problem I'm dealing with is the removal of test equipment artifacts from experimentally derived datasets. In layman's terms, the sensing circuit has a tendency to oscillate a little occasionally; which introduces inflections into the data which must be removed before the data can be used for the next part of the processing which requires monotonic data.

This is a case where a picture is worth a thousand words: http://oi63.tinypic.com/2z72xbr.jpg. The blue is (a small subset of) the raw data and the red is the desired output.

This code performs the cleanup. (It runs as a pipe filter, and I've adapted it to use __DATA__ for the small subset shown in the graphic for posting purposes.):

The basic mechanism is to compare the points as successive pairs and look for a negative slope. When a negative slope is found, *both* points are removed from the dataset.

The complication is that pairwise comparisons have to continue with the next point against the *previous* unremoved point.

To (I hope) clarify, in the following:

H G F E D C B A
  • Point A is passed through
  • Points A & B are compared, positive slope: pass B through.
  • Points B & C are compared, positive slope: pass C through.
  • Points C & D are compared, negative slope: block D and remove C.
  • Points B & E are compared, positive slope: pass E through.
  • Points E & F are compared, negative slope: block F and remove E.
  • Points B & G are compared, positive slope: pass G through.
  • Points G & H are compared, positive slope: pass H throught.

Resultant dataset: A, B, G, H.

The above code works; but is there a better way of coding it?

I'm not going to define "better"; I'd simply like to see a range of alternative codings.

(Please note: this is not about filtering or smoothing or interpolating the data; just removing spurious values.)

The output from the sample code:


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
In the absence of evidence, opinion is indistinguishable from prejudice.

Replies are listed 'Best First'.
Re: How would you code this?
by kennethk (Abbot) on Apr 06, 2016 at 20:14 UTC
    For the sample data, your y-values are monotonic. If this is generally true, there's no point in including them in the comparison.
    my $modified = reduce{ my( $x1, $y1, $x2, $y2 ) = ( @{ @{ $a }[ -1 ] }, @$b ); $x2 < $x1 ? pop @{ $a } : push @{ $a }, $b; $a; } [ shift @points ], @points;
    Assuming you necessarily hold all data in memory, you could also use splice in a while, which feels more natural to me:
    my $i = 1; while ($i < @points) { if ($points[$i-1][0] > $points[$i][0]) { splice @points,$i-1,2; --$i or $i = 0; } else { $i++; } }
    or just a streaming push/pop, which is algorithmically what you've done, but reads better, I think.
    my $modified = []; for my $point (@points) { if (@$modified and $modified->[-1][0] > $point->[0]) { pop @$modified; } else { push @$modified, $point; } }
    I'd like there to be a more streaming solution here, but there's no way to tell if the first point should have been accepted until you are past halfway in the list, and you don't a priori know how long the list is.

    Golfed:

    my @m; @m&&$m[-1][0]>$_->[0]?pop @m:push @m,$_ for @points;

    #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.

      I like your 3rd version best; and that was going to be my choice before graff threw his spanner into my works.

      Now I've more research to do; and I'll look at adapting it once I decide which way I'm going to jump.

      Thank a lot for your input.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
      In the absence of evidence, opinion is indistinguishable from prejudice.
      .
Re: How would you code this?
by graff (Chancellor) on Apr 07, 2016 at 06:14 UTC
    First, I have to wonder if the "desired" output you showed is "definitively correct" (in some sense), or whether it's simply "good enough to move along" (and other possible outputs - discarding a slightly different set of data rows - might be just as good, if not better).

    Second, I'm curious about this distinctive property of your data sample: there is exact repetition of particular X values in the regions where certain data rows need to be discarded. Is this simply an artifact of how you created this data sample? If not, then would it happen to be a stable, reliable property of the actual data? If it is stable, this is something an algorithm could exploit. For example:

    (a) if the current x is lower than the previous x: (a.1) if the current x has been seen before, discard the current p +oint (a.2) else discard the previous point. (b) elsif the current x is identical to the prev. x (due to a.1 on pre +v.iteration) (b.1) discard previous point
    That logic will discard a different set of points than your "desired" output, but the selected points will be monotonic (if that property of repeating x values holds true).

    If your pipeline streams get really huge, you might need to worry about not keeping track of too many "previously seen x values", but perhaps that's not a concern.

    Here's the perl code for my approach (minus the input __DATA__):

    #! perl -slw use strict; printf "%s", scalar <DATA>; my @points = map[ split ' ', $_ ], <DATA>; my @modified; my %xseen; push @modified, shift @points; $xseen{$modified[-1][0]} = undef; while ( @points ) { if ( $points[0][0] < $modified[-1][0] ) { if ( exists( $xseen{$points[0][0]} )) { shift @points; next; } else { pop @modified; } } elsif ( $points[0][0] eq $modified[-1][0] ) { pop @modified; } push @modified, shift @points; $xseen{$modified[-1][0]} = undef; } print join ' ', @$_ for @modified;
    And below is the output, pasted next to your "desired" output - mine is on the left side, and yours is on the right. Note that my code outputs 40 points compared to your 35 (I have the corresponding x's aligned so you can see the differences more easily). Also, there's one point where I kept the same x value as you, but selected a different y value at that x. (As a general rule, other things being relatively equal, I'd expect it might be desirable to discard fewer data points - but I suppose that depends on the nature of your test equipment.)

    (updated to fix the wording in the 2nd paragraph and pseudo-code)

      Graff. You've certainly given me considerable food for thought. And a bunch of extra work to do. So "Thank you" and "thaaaanks a bunch" :)

      Here is a composite zoom on the 5 differences between your output and mine.

      • Top left: You're algorithm picks two points to the left of the two mine picks.

        I think the ideal would be to pick my first point, your second point and the point slap bang in between all four points; but I see no algorithm that would make that choice without hard coding a special case.

        In the end, I think either chosen pair is equally valid.

      • Top right: Your algorithm retains one extra point that happens fall almost exactly on the line between the two points mine retains.

        I concur with you that discarding less is a good thing; in this case, I think the difference would be negligible.

      • Bottom left: Your's retains two extra points.

        The upper extra point is much the same as the previous; negligible difference beyond keeping an extra point.

        The lower extra point, the difference is bigger, more significant.

      • Lower right (inset): Two extra points; the difference about the same as the lower point above.

        I'm undecided if these differences are important; but they are extra points retained, which I like.

      One of, if not the, primary goals of using a discrete filter algorithm, rather than a continuous fitting or smoothing algorithm, is the desire to retain as much of the actual data as possible. Continuous algorithms like moving average, 3-point median and Loess, all have affects upon all the points, not just those at and around the discontinuities; often inventing new points, and subtly shifting existing points well away from the discontinuities; and they also don't guarantee to remove all inflections.

      Upshot: I'm going to have to run your algorithm against mine on a few much larger samples and check to ensure that there are no significant downsides to your algorithm. If not, I will be dumping mine in favour of yours.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
      In the absence of evidence, opinion is indistinguishable from prejudice.
        I noticed that some x values are repeated up to five times, and that the selection of a trajectory through such a region (depending on what the algorithm does) could vary between:
        c c c vs b vs b b a a a
        (i.e. after dropping some number of points from the input series, the middle point either falls midway between the other two, or creates an "elbow" close to the previous or following point).

        I thought about working out a way to maintain a stack, such that from the first (bottom) to the last (top) stack element, there would be at most three distinct x values (that seems to be the extent of the "oscillation" you're seeing).

        Once a fourth x value appears in the next point, review the stack and output the points that comprise the smoothest trajectory. But that's a lot of work for a potentially insignificant gain in "accuracy".

        UPDATE: As for your "primary" goal of "preserving the original data" as much as possible, as opposed to creating a smoothed sequence with adjustments affecting all data points: given that the input appears to be quantized, any desire to preserve that original data is actually going to preserve the nature of the measuring device, rather than the nature of the thing being measured. Maybe Certainly there's value in that, but depending on what your downstream processes are supposed to do with the data, you might rather let those processes get a "fully smoothed" version of the data, because this represents a presumably reasonable fiction, as opposed to the quantized, jagged fiction being created by your measuring device.

        Another update: obviously, by preserving the original data - but not necessarily using it as-is for certain downstream processes - you'll get to do comparisons if/when you use a different device (or a different configuration of the current device).

Re: How would you code this?
by Marshall (Canon) on Apr 06, 2016 at 21:37 UTC
    Here's my go at it. I did a compare file and it looks like this produces an identical output to your code.
    #! perl -slw use strict; use List::Util qw[ reduce ]; use Data::Dump qw[ pp ]; printf "%s", scalar <DATA>; my @points = map[ split ' ', $_ ], <DATA>; my $i=0; while ($i < @points-1) { my ($x1,$y1) = @{$points[$i]}; my ($x2,$y2) = @{$points[$i+1]}; last unless defined $x2; if ( ($y2 - $y1 ) / ( $x2 - $x1 ) < 0 ) { splice @points,$i,2; #remove both points $i=-2; #backup an extra point to restart $i=0 if $i<0; } $i++; } print join ' ', @$_ for @points; __DATA__ X Y 0.0036759 0.4018006 0.0036962 0.4074616 0.0037064 0.4124646 0.0036962 0.4170003 0.0037064 0.4216399 0.0037166 0.4351084 0.0037064 0.4396787 0.0037166 0.4438854 0.0037268 0.4518142 0.0037166 0.4593275 0.0037268 0.4628417 0.0037370 0.4730210 0.0037268 0.4764141 0.0037370 0.4832176 0.0037268 0.4864203 0.0037370 0.4894152 0.0037471 0.4952320 0.0037675 0.4979326 0.0037879 0.5014988 0.0038082 0.5057747 0.0038184 0.5166984 0.0038286 0.5332830 0.0038184 0.5387016 0.0038082 0.5500408 0.0038184 0.5558402 0.0038286 0.5613627 0.0038388 0.6026338 0.0038591 0.6216075 0.0038490 0.6257104 0.0038693 0.6343489 0.0038490 0.6382094 0.0038693 0.6420872 0.0038490 0.6455669 0.0038693 0.6536515 0.0038591 0.6673797 0.0038693 0.6709805 0.0038795 0.6808655 0.0038897 0.6866130 0.0039202 0.6981425 0.0039406 0.7057251 0.0039610 0.7105550 0.0039508 0.7161121 0.0039610 0.7216518 0.0039712 0.7329564 0.0039610 0.7381326 0.0039712 0.7433434 0.0039813 0.7482426 0.0039712 0.7529860 0.0039813 0.7577987 0.0039712 0.7622998 0.0039915 0.7713018 0.0040119 0.7911064 0.0039915 0.7945342 0.0040017 0.7981869 0.0040119 0.8014242 0.0040221 0.8074314 0.0040119 0.8101147 0.0040221 0.8130404 0.0040119 0.8156025 0.0040221 0.8210730 0.0040119 0.8237044 0.0040221 0.8264223 0.0040526 0.8290191 0.0040628 0.8323083 0.0040933 0.8361688 0.0041035 0.8409814 0.0041239 0.8466942 0.0041137 0.8527880 0.0041239 0.8591068 0.0041341 0.864785 0.0041239 0.8703767 0.0041443 0.8760895 0.0041544 0.8858707 0.0041646 0.8995296 0.0041748 0.9034420 0.0041646 0.9072853 0.0041748 0.9111804 0.0041646 0.9148159 0.0041748 0.9187110 0.0041850 0.9221387 0.0041748 0.9254799 0.0041850 0.9288903 0.0041748 0.9322834 0.0041850 0.9357977 0.0041952 0.9453537
    Output:
    X Y 0.0036759 0.4018006 0.0036962 0.4074616 0.0037064 0.4216399 0.0037166 0.4438854 0.0037268 0.4628417 0.0037370 0.4894152 0.0037471 0.4952320 0.0037675 0.4979326 0.0037879 0.5014988 0.0038082 0.5057747 0.0038184 0.5558402 0.0038286 0.5613627 0.0038388 0.6026338 0.0038693 0.6709805 0.0038795 0.6808655 0.0038897 0.6866130 0.0039202 0.6981425 0.0039406 0.7057251 0.0039610 0.7216518 0.0039712 0.7433434 0.0039915 0.7713018 0.0040017 0.7981869 0.0040119 0.8014242 0.0040221 0.8264223 0.0040526 0.8290191 0.0040628 0.8323083 0.0040933 0.8361688 0.0041035 0.8409814 0.0041239 0.8591068 0.0041443 0.8760895 0.0041544 0.8858707 0.0041646 0.8995296 0.0041748 0.9187110 0.0041850 0.9357977 0.0041952 0.9453537

      I concur that this produces the same output. My main objections to are: a) it uses more decision points to achieve it; b) it messes with the input array (curable).

      Thanks for your input.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
      In the absence of evidence, opinion is indistinguishable from prejudice.
        I made a couple of slight revisions (eliminate the "last" statement and fixed a boundary condition issue at index 0). This still produces the same output.

        Appreciated your thanks. I leave "what is better" up to you.

        #! perl -slw use strict; printf "%s", scalar <DATA>; my @points = map[ split ' ', $_ ], <DATA>; my $i=0; while ($i < @points-2) #quit when there aren't 2 points left { my ($x1,$y1) = @{$points[$i]}; my ($x2,$y2) = @{$points[$i+1]}; if ( ($y2 - $y1 ) / ( $x2 - $x1 ) < 0 ) { splice @points,$i,2; #remove both points $i--; #backup a point to restart $i=0 if $i<0; #boundary condition at index 0 } else { $i++; } } print join ' ', @$_ for @points;
Re: How would you code this?
by kennethk (Abbot) on Apr 07, 2016 at 00:03 UTC
    Fun, kinda silly, and doesn't actually work on your trial set. Failures happen around points with large deviations from the smoothed concept.
    my @also = sort {$a->[0] <=> $b->[0]} @points; for my $i (reverse 0 .. $#points) { splice @also, $i, 1 if $points[$i] ne $also[$i]; }
    More generally, I feel like there is a merge-sort-like solution in there somewhere, but can't quite seem to distill it in my brain. Not that it would be practical, since push-pop is already O(N), but that's not the point, right?

    Streaming buffer w/ STDIN/STDOUT:

    my $frame = 10; my @buffer; $, = ' '; $\ = "\n"; while (<>) { my @row = split; if (@buffer and ($buffer[-1][0] < $row[0]) ^ ($buffer[-1][1] < $ro +w[1]) ) { pop @buffer; } else { print @{shift @buffer} if $frame < push @buffer, \@row; } } print @$_ for @buffer
    Note the ridiculous micro-optimization for comparison.

    #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.

Re: How would you code this?
by Anonymous Monk on Apr 06, 2016 at 20:48 UTC

    The pipe filter also uses reduce?

    The idea that comes to my mind is a sliding window , like @ten, so before your program reduce

    ahoy(\@points); sub ahoy { my( $points ) = @_; my @ten; for my $point ( @$points ){ push @ten, $point; if( @ten > 1 and $ten[-1][0] < $ten[-2][0] ){ ## back +ing up? ditch it splice @ten, -2; ## pop pop @ten; } if( @ten > 9 ){ print join ' ', @$_ for splice @ten, 0, 8 ; ## shift 8 } } print join ' ', @$_ for @ten; print '####'; }

      As graff pointed out, algorithmically, there is no way to predict how big a window would be required.

      Thanks for the input.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
      In the absence of evidence, opinion is indistinguishable from prejudice.
Re: How would you code this?
by Laurent_R (Canon) on Apr 07, 2016 at 15:42 UTC
    This may seem to be a very naive approach, but what about simply removing lines where X is going down compared to previous values? Something like this:
    use strict; use warnings; my $previous_x = 0; while (<DATA>) { my ($x, $y) = split; next if $x < $previous_x; $previous_x = $x; print; } __DATA__ 0.0036759 0.4018006 0.0036962 0.4074616 0.0037064 0.4124646 [rest of input data omitted for brevity]
    Output:
    0.0036759 0.4018006 0.0036962 0.4074616 0.0037064 0.4124646 0.0037064 0.4216399 0.0037166 0.4351084 0.0037166 0.4438854 0.0037268 0.4518142 0.0037268 0.4628417 0.0037370 0.4730210 0.0037370 0.4832176 0.0037370 0.4894152 0.0037471 0.4952320 0.0037675 0.4979326 0.0037879 0.5014988 0.0038082 0.5057747 0.0038184 0.5166984 0.0038286 0.5332830 0.0038286 0.5613627 0.0038388 0.6026338 0.0038591 0.6216075 0.0038693 0.6343489 0.0038693 0.6420872 0.0038693 0.6536515 0.0038693 0.6709805 0.0038795 0.6808655 0.0038897 0.6866130 0.0039202 0.6981425 0.0039406 0.7057251 0.0039610 0.7105550 0.0039610 0.7216518 0.0039712 0.7329564 0.0039712 0.7433434 0.0039813 0.7482426 0.0039813 0.7577987 0.0039915 0.7713018 0.0040119 0.7911064 0.0040119 0.8014242 0.0040221 0.8074314 0.0040221 0.8130404 0.0040221 0.8210730 0.0040221 0.8264223 0.0040526 0.8290191 0.0040628 0.8323083 0.0040933 0.8361688 0.0041035 0.8409814 0.0041239 0.8466942 0.0041239 0.8591068 0.0041341 0.864785 0.0041443 0.8760895 0.0041544 0.8858707 0.0041646 0.8995296 0.0041748 0.9034420 0.0041748 0.9111804 0.0041748 0.9187110 0.0041850 0.9221387 0.0041850 0.9288903 0.0041850 0.9357977 0.0041952 0.9453537
    You might also change the relevant code line to:
    next if $x <= $previous_x;
    if you want to remove duplicate X values.
      what about simply removing lines where X is going down compared to previous values?

      That doesn't achieve the requirement of strict monotonicity (is that a real word? :). Ie. it produces runs of points with different Y-values and the same x-value:

      03 0.0037064 0.4124646 04 0.0037064 0.4216399 05 0.0037166 0.4351084 06 0.0037166 0.4438854 07 0.0037268 0.4518142 08 0.0037268 0.4628417 09 0.0037370 0.4730210 10 0.0037370 0.4832176 11 0.0037370 0.4894152

      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
      In the absence of evidence, opinion is indistinguishable from prejudice.
        Yeah, I realized that, but did not know whether it was important. But that's why I also said that you could remove duplicate X values if needed:
        You might also change the relevant code line to:
        next if $x <= $previous_x;
        if you want to remove duplicate X values.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1159749]
Approved by stevieb
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others scrutinizing the Monastery: (4)
As of 2021-04-17 20:00 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?