Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?

astaines's scratchpad

by astaines (Curate)
on Jun 01, 2004 at 21:43 UTC ( [id://358634] : scratchpad . print w/replies, xml ) Need Help??

package Data::Validator; =head1 NAME Data::Validator Factory Class to validate data items =head1 DESCRIPTION This is an attempt to create an object which will permit semi-automatic verification of a data value. =head1 SYNOPSIS use Data::Validator; my $item = data_Validator->new(); #Create a new Data::Validator, called $item. #Set values $item->name('fred'); $item->values(1,2,3); or $item->values(\@array); $item->missing('*'); or $item->missing(''); #undef is unlikely to be sensible! $item->min(0); $item->max(100); $item->verify($reference_to_subroutine); #Used in the $item->validate() function $item->transform($reference_to_subroutine); #Used in the $item->put() function #Get values my $name = $item->name(); my @values = $item->values(); my $missing = $item->missing(); etc... #Use it.. $item->validate(); #Returns 1 for success, 2,4,8,16 or 32 for different failures $item->put(); =head1 USAGE Many people work with data organised as records, each containing (potentially many) variables. It is often necessary to process files of such records, and to test every variable within every record to ensure that each one is valid. I do this before putting data from very large flat files into my databases. For each variable I had a need to define specific, sometimes complex rules for validity, then implement them, and check them. This is what Data::Validator is for. Note carefully that Data::Validators handle only one scalar vlaue at a time. This value could come from a file, a database, an array, a hash or your granny's parrot. Data::Validators don't care. I use Data::Validator as follows. I create one for every named variable in my data file. In many real applications most of this setup can be done by looping over a list of variable names, creating many Data::Validators each named for the corresponding variable. Common features, like missing values, and names can be set in this loop. Specifics, like values(), min(), max(), verify() and so on can be set individually. I then create a hash to hold all of the Data::Validators for a particular data source, The keys of this hash are the names of the variables, and the values are the Data:Validators themselves. Y.M.M.V. =head1 ROLE A Data::Validator exists (almost) solely to create two functions - validate() and put(). They make it easy to apply complex tests for 'validity' to data. Typically you will set up many of these, one per variable, once at the start of a program, and you then use them to validate() and put() each individual item of data. Data::Validator neither knows nor cares where the data comes from, you just feed data items to the correct ->validate() and ->put() one at at time, and they get checked. There is no useful way to check the values of a variable depending on the values of another variable in the same record. This is a different problem, one which could be approached using Data::Validator as a building block. =head2 PROBLEM ADDRESSED A fairly common problem in my work is the following: I get a data file, which has been created, often using Excel or Access. It is riddled with errors, because it wasn't checked at all during data entry. (I'm a *very* good data entry person, and I make about 1 mistake per 100 data items.) Before I can use it I need to check the actual values in the data file. Typically my clients don't know exactly what the legitimate values are for each variable. For example a variable called 'sex' is supposed to be 0 or 1, (female or male) and there are actually 140 '2's in the data set. On enquiry, it turns out that 2 is the missing value for that variable. (Of course for other variables in the data set the missing value might be '3', or '8' or '-' or '*' or just blank). I need to check every individual value in every record in a file, against the values it is supposed to have, and I also often need to change a variable, so that I can stuff it into a database. Clearly these two tasks are closely related, and so I wrote a module which can do both, if you want. =cut #use stuff use strict; use Carp; #Package globals our $VERSION = '0.6'; my $Debugging = 0; =head1 PUBLIC FUNCTIONS =head2 new() The new() function initialises a blank Data::Validator with all of it's contents set explicitly to undef. C<< my $item = Data::Validator->new(); >> =cut #Initiate the Data::Validator sub new { my $proto = shift; my $class = ref($proto) || $proto; my $self = {}; #Documentation only $self->{NAME} = undef; # Name of the variable or whatever, not currently used #Used for validation $self->{MIN} = undef; # Numerically (or alphabetically) smallest value $self->{MAX} = undef; # Numerically (or alphabetically) largest value $self->{VALUES} = undef; # Reference to an array of all possible values $self->{VERIFY} = undef; # Reference to a function capable of verifying variable e.g. dates $self->{LOOKUP} = undef; # Reference to a DBI Satement handle to do lookup on possible values #Used for validation and transformation $self->{MISSING} = undef; # Missing value, accepted as a valid value, and transformed to undef in put() #Used for transformation only $self->{TRANSFORM}= undef; # Reference to a function capable of transforming variable for output bless ($self, $class); return $self; }# New() =head2 zap() The zap() function re-initialises an existing Data::Validator with all of it's contents reset explicitly back to undef. This is used in some of the test scripts, but may not have many other uses. C<< $item->zap(); >> =cut sub zap { my $self = shift; $self->{NAME} = undef; $self->{MIN} = undef; $self->{MAX} = undef; $self->{VALUES} = undef; $self->{VERIFY} = undef; $self->{LOOKUP} = undef; $self->{MISSING} = undef; $self->{TRANSFORM}= undef; return $self; } =head1 put() and validate() These two functions are what Data::Validator is meant to create. validate() checks a scalar to see if it is acceptable. put() is used to transform a scalar for otuput =head2 validate() validate() takes a scalar, and tests it, using all of the tests which you have chosen to put into the particular Data::Validator. It reports, either success (1) or that at least one test failed and it returns a status code, presently =over 4 =item * 1 means the item was either ok (passed all tests) *or* the missing value, in other words, acceptable... =item * 2 means the item was undefined, in the Perl sense of undefined. Note that this is usually a programming error, not a data error! =item * 4 means the item was too small =item * 8 means the item was too big =item * 16 means the item was not in the list of approved values given to value() =item * 32 means the item failed the verify() subroutine =back Do B<not> ignore these return codes when using this module. Also, please tell me if you think 1 = acceptable or missing and 0 = failure would be better return values. =cut sub validate { my $self = shift; my $datum = shift; #Tests placed in approximate order of cost! if (defined($self->missing()) && ($datum eq $self->missing())) {return 1;}; #It's missing - return validated, and move on unless (defined($datum)) {return 2}; #It's undefined - complain! It shouldn't be. if (defined($self->min()) && ($datum < $self->min())) {return 4;}; if (defined($self->max()) && ($datum > $self->max())) {return 8;}; #Too big or too small if (defined($self->values())) { my %hash = %{ $self->values()}; unless (exists $hash{$datum}) {return 16;}; }; # Not in the approved list of values if (defined($self->verify())) { my $coderef = $self->verify(); unless (&$coderef($datum)) {return 32}; }; #Not confirmed by verification subroutine return 1; # All is well } =head2 put() put() returns the data value, =over 4 =item * or the transformed data value by the transform() function provided by you, =item * or undef, if the data value was the missing() value. =back =cut sub put { my $self = shift; my $datum = shift; if (defined($self->missing()) && ($datum eq $self->missing())) {return undef;}; # It's missing if (defined($self->transform())) { # It needs to be transformed, and it's not missing my $coderef = $self->transform(); $datum =&$coderef($datum); return $datum; } #Just pass it through return $datum; } =head1 Get and Set functions Data::Validator implements a policy to decide on the acceptability or otherwise of scalar value, and to transform this value for output. The B<Set> functions allow you to define the policy. These functions require an argument. These functions are most likely to be used when creating a Data::Validator. The corresponding B<Get> functions are intended for use B<only> within the Data::Validator, when creating the put() and validate() functions. These are the no argument functions. =head2 name() name() sets or gets the name of the Data:Validator - I use this just to remind me, and I usually set it to the name of the variable. This doesn't get used anywhere else - it's just icing, but it sure makes debugging easier. C<< $item->name("Item"); >> =cut sub name { my $self = shift; if (@_) { $self->{NAME} = shift } return $self->{NAME}; } =head2 missing() missing() gets or sets the missing value for a Data::Validator. This does matter, because missing values are acceptable to validate(), and because put() changes missing values to undef. This is used by *both* put() and validate(). If you don't understand why missing values are *acceptable* you need to think harder about the problem we're solving here. Would you like missing() to accept several alternative missing values? Let me know... C<< $item->missing(""); >> =cut sub missing { my $self = shift; if (@_) { $self->{MISSING} = shift } return $self->{MISSING}; } =head2 min()/max() min() and max() get and set the lower and upper limits for a Data::Validator. These are used by validate() to check whether a value is greater than or less than a limit. These could be used for character data, but really make more sense for numeric values. Note that I don't really understand how min and max work for character data yet. Note also that perl may occasionally require you to tell it that a variable is numeric. (try adding 0 to it if this problem arises). C<< $item->min(-5) >> or C<< $item->max(42) >> =cut sub min { my $self = shift; if (@_) { $self->{MIN} = shift } return $self->{MIN}; } sub max { my $self = shift; if (@_) { $self->{MAX} = shift } return $self->{MAX}; } =head2 transform() transform() sets or gets a reference to a subroutine, a reference of type CODE. This is used by put() to change the value of a variable. This is very flexible, and has covered all of my needs so far. C<< $item->transform(\&test) >> =cut sub transform { my $self = shift; if (@_) { my $ref = shift; if (_ref_check($ref,'CODE')) { # Is it a CODEREF?? $self->{TRANSFORM} = $ref; return $self->{TRANSFORM}; } } # if(@_) return $self->{TRANSFORM}; } =head2 verify() verify() sets or gets a reference to a subroutine, a reference of type CODE. This is used by validate() to check if a variable complies with certain rules. This is the most complicated method of testing a value but it can be very useful in some circumstances. Remember there isn't any built in way to use the value of *another* variable from the same record in this subroutine. C<< $item->verify(\&test); >> =cut sub verify { my $self = shift; if (@_) { my $ref = shift; if (_ref_check($ref,'CODE')) { # Is it a CODEREF?? $self->{VERIFY} = $ref; return $self->{VERIFY}; } } # if(@_) return $self->{VERIFY}; } =head2 values() values() sets or gets an array reference containing all of the possible values of a variable. This is used by validate() to check if a variable has one of a list of values. The array reference gets turned into a hash internally so that I can use exists(), but in Perl 5.8 and up exists() works for arrays. I chose to initialise this using array references because the syntax is easy - C<< $item->values(0,1,2,3,4); >> or C<< $item->values(\@array); >> =cut sub values { my $self = shift; if (@_) { my $ref = shift; if (_ref_check($ref,'ARRAY')) { # Is it an ARRAY reference?? $self->{TRANSFORM} = $ref; my %hash; grep { ! $hash{$_} ++ } @$ref; #Perl Cookbook Recipe 4.6 Thanks! $self->{VALUES} = \%hash; return $self->{VALUES}; } } # if(@_) return $self->{VALUES}; } #End of subroutine values =head1 PRIVATE FUNCTIONS =head2 _ref_check() _ref_check() is a private subroutine which looks to see if a reference refers to what you expect. Don't use it. =cut sub _ref_check { my ($test,$should_be) = @_; #Why doesn't this get called with self as it's first argument? my $ref = ref($test); unless ($ref eq $should_be) { if (length($ref) > 0) { carp ("\n>> $test isn't a reference to an array, but rather a reference to a ".$ref."\n") } else { carp ("\n>> $test isn't an array reference at all, but a SCALAR\n") }# if (defined($refref)) return 0; } # unless ($ref eq $should_be) return 1; } #End of subrotuine _ref_check return 1; #Required for all modules =head1 KNOWN BUGS min() and max() don't really work for non-numeric values, arguably they should =head1 AUTHOR Anthony Staines <> =head1 TO DO This is an alpha release. I am actively seeking feedback on the user interface. Please let me kow what you think. The validate and put functions are called a lot - several hundred thousand times in my applications. The program spends most of it's time executing these. (Confirmed by profiling). I will implement an eval based version of these. Try with your comments =head1 SEE ALSO L<perl>. =head 1 COPYRIGHT AND DISCLAIMER This program is Copyright 2002,1990 by Anthony Staines. This program is free software; you can redistribute it and/or modify it under the terms of the Perl Artistic License or the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. If you do not have a copy of the GNU General Public License write to the Free Software Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. =cut