snippet
lin0
<p>A common practice in [http://en.wikipedia.org/wiki/Machine_learning|machine learning] is to [http://en.wikipedia.org/wiki/Preprocessing|preprocess] the data before building a model. One popular preprocessing technique is data normalization. Normalization puts the variables in a restricted range (with a zero [http://en.wikipedia.org/wiki/Mean|mean] and 1 [http://en.wikipedia.org/wiki/Standard_deviation|standard deviation]). This is important to achieve efficient and precise numerical computation.</p>
<p>In this snippet, I present how to do data normalization using the [http://pdl.perl.org/|Perl Data Language]. The input is a piddle (see comment below for a definition) in which each column represents a variable and each row represents a pattern. The output is a piddle (in which each variable is normalized to have a 0 [http://en.wikipedia.org/wiki/Mean|mean] and 1 [http://en.wikipedia.org/wiki/Standard_deviation|standard deviation]), and the [http://en.wikipedia.org/wiki/Mean|mean] and [http://en.wikipedia.org/wiki/Standard_deviation|standard deviation] of the input piddle.</p>
<p>What are Piddles?</p>
<p>They are a new data structure defined in the [http://pdl.perl.org/|Perl Data Language]. As indicated in [id://598007]:</p>
<blockquote><i>Piddles are numerical arrays stored in column major order (meaning that the fastest varying dimension represent the columns following computational convention rather than the rows as mathematicians prefer). Even though, piddles look like Perl arrays, they are not. Unlike Perl arrays, piddles are stored in consecutive memory locations facilitating the passing of piddles to the C and FORTRAN code that handles the element by element arithmetic. One more thing to note about piddles is that they are referenced with a leading $</i></blockquote>
<p>Cheers,</p>
<p>[lin0]</p>
<CODE>
#!/usr/bin/perl
use warnings;
use strict;
use PDL;
use PDL::NiceSlice;
# ================================
# normalize
# ( $output_data, $mean_of_input, $stdev_of_input) =
# normalize( $input_data )
#
# processess $input_data so that $output_data
# has 0 mean and 1 stdev
#
# $output_data = ( $input_data - $mean_of_input ) / $stdev_of_input
# ================================
sub normalize {
my ( $input_data ) = @_;
my ( $mean, $stdev, $median, $min, $max, $adev )
= $input_data->xchg(0,1)->statsover();
my $idx = which( $stdev == 0 );
$stdev( $idx ) .= 1e-10;
my ( $number_of_dimensions, $number_of_patterns )
= $input_data->dims();
my $output_data
= ( $input_data - $mean->dummy(1, $number_of_patterns) )
/ $stdev->dummy(1, $number_of_patterns);
return ( $output_data, $mean, $stdev );
}
</CODE>