package Data::Validator;
=head1 NAME
Data::Validator Factory Class to validate data items
=head1 DESCRIPTION
This is an attempt to create an object which will permit semi-automatic verification of a data value.
=head1 SYNOPSIS
use Data::Validator;
my $item = data_Validator->new(); #Create a new Data::Validator, called $item.
#Set values
$item->name('fred');
$item->values(1,2,3); or $item->values(\@array);
$item->missing('*'); or $item->missing(''); #undef is unlikely to be sensible!
$item->min(0); $item->max(100);
$item->verify($reference_to_subroutine); #Used in the $item->validate() function
$item->transform($reference_to_subroutine); #Used in the $item->put() function
#Get values
my $name = $item->name();
my @values = $item->values();
my $missing = $item->missing();
etc...
#Use it..
$item->validate(); #Returns 1 for success, 2,4,8,16 or 32 for different failures
$item->put();
=head1 USAGE
Many people work with data organised as records, each containing
(potentially many) variables. It is often necessary to process files
of such records, and to test every variable within every record to ensure that
each one is valid. I do this before putting data from very large flat files into my databases.
For each variable I had a need to define specific, sometimes complex rules for validity,
then implement them, and check them. This is what Data::Validator is for.
Note carefully that Data::Validators handle only one scalar vlaue at a time. This
value could come from a file, a database, an array, a hash or your granny's parrot.
Data::Validators don't care.
I use Data::Validator as follows. I create one for every named variable in my
data file. In many real applications most of this setup can be done by looping
over a list of variable names, creating many Data::Validators each named for
the corresponding variable. Common features, like missing values, and names
can be set in this loop. Specifics, like values(), min(), max(), verify() and so on
can be set individually. I then create a hash to hold all of the Data::Validators for
a particular data source, The keys of this hash are the names of the variables,
and the values are the Data:Validators themselves.
Y.M.M.V.
=head1 ROLE
A Data::Validator exists (almost) solely to create two functions - validate() and put().
They make it easy to apply complex tests for 'validity' to data.
Typically you will set up many of these, one per variable, once at the start
of a program, and you then use them to validate() and put() each individual item of data.
Data::Validator neither knows nor cares where the data comes from, you just feed data
items to the correct ->validate() and ->put() one at at time, and they get checked.
There is no useful way to check the values of a variable depending on the values
of another variable in the same record. This is a different problem, one which could
be approached using Data::Validator as a building block.
=head2 PROBLEM ADDRESSED
A fairly common problem in my work is the following:
I get a data file, which has been created, often using Excel or Access. It is
riddled with errors, because it wasn't checked at all during data
entry. (I'm a *very* good data entry person, and I make about
1 mistake per 100 data items.)
Before I can use it I need to check the actual values in the data file.
Typically my clients don't know exactly what the legitimate values are for
each variable. For example a variable called 'sex' is supposed to be 0 or 1,
(female or male) and there are actually 140 '2's in the data set. On enquiry,
it turns out that 2 is the missing value for that variable. (Of course for
other variables in the data set the missing value might be '3', or '8' or
'-' or '*' or just blank).
I need to check every individual value in every record in a file,
against the values it is supposed to have, and I also often need to
change a variable, so that I can stuff it into a database. Clearly these two
tasks are closely related, and so I wrote a module which can do both,
if you want.
=cut
#use stuff
use strict;
use Carp;
#Package globals
our $VERSION = '0.6';
my $Debugging = 0;
=head1 PUBLIC FUNCTIONS
=head2 new()
The new() function initialises a blank Data::Validator with all of it's contents set
explicitly to undef.
C<< my $item = Data::Validator->new(); >>
=cut
#Initiate the Data::Validator
sub new {
my $proto = shift;
my $class = ref($proto) || $proto;
my $self = {};
#Documentation only
$self->{NAME} = undef; # Name of the variable or whatever, not currently used
#Used for validation
$self->{MIN} = undef; # Numerically (or alphabetically) smallest value
$self->{MAX} = undef; # Numerically (or alphabetically) largest value
$self->{VALUES} = undef; # Reference to an array of all possible values
$self->{VERIFY} = undef; # Reference to a function capable of verifying variable e.g. dates
$self->{LOOKUP} = undef; # Reference to a DBI Satement handle to do lookup on possible values
#Used for validation and transformation
$self->{MISSING} = undef; # Missing value, accepted as a valid value, and transformed to undef in put()
#Used for transformation only
$self->{TRANSFORM}= undef; # Reference to a function capable of transforming variable for output
bless ($self, $class);
return $self;
}# New()
=head2 zap()
The zap() function re-initialises an existing Data::Validator with all of it's contents reset
explicitly back to undef. This is used in some of the test scripts, but may not have many other uses.
C<< $item->zap(); >>
=cut
sub zap {
my $self = shift;
$self->{NAME} = undef;
$self->{MIN} = undef;
$self->{MAX} = undef;
$self->{VALUES} = undef;
$self->{VERIFY} = undef;
$self->{LOOKUP} = undef;
$self->{MISSING} = undef;
$self->{TRANSFORM}= undef;
return $self;
}
=head1 put() and validate()
These two functions are what Data::Validator is meant to create.
validate() checks a scalar to see if it is acceptable.
put() is used to transform a scalar for otuput
=head2 validate()
validate() takes a scalar, and tests it, using all of the tests which you have
chosen to put into the particular Data::Validator. It reports, either success (1)
or that at least one test failed and it returns a status code, presently
=over 4
=item *
1 means the item was either ok (passed all tests) *or* the missing value, in other words, acceptable...
=item *
2 means the item was undefined, in the Perl sense of undefined.
Note that this is usually a programming error, not a data error!
=item *
4 means the item was too small
=item *
8 means the item was too big
=item *
16 means the item was not in the list of approved values given to value()
=item *
32 means the item failed the verify() subroutine
=back
Do B<not> ignore these return codes when using this module.
Also, please tell me if you think 1 = acceptable or missing and 0 = failure would be better return values.
=cut
sub validate {
my $self = shift;
my $datum = shift;
#Tests placed in approximate order of cost!
if (defined($self->missing()) && ($datum eq $self->missing())) {return 1;};
#It's missing - return validated, and move on
unless (defined($datum)) {return 2};
#It's undefined - complain! It shouldn't be.
if (defined($self->min()) && ($datum < $self->min())) {return 4;};
if (defined($self->max()) && ($datum > $self->max())) {return 8;};
#Too big or too small
if (defined($self->values())) {
my %hash = %{ $self->values()};
unless (exists $hash{$datum}) {return 16;};
};
# Not in the approved list of values
if (defined($self->verify())) {
my $coderef = $self->verify();
unless (&$coderef($datum)) {return 32};
};
#Not confirmed by verification subroutine
return 1;
# All is well
}
=head2 put()
put() returns the data value,
=over 4
=item *
or the transformed data value by the transform() function provided by you,
=item *
or undef, if the data value was the missing() value.
=back
=cut
sub put {
my $self = shift;
my $datum = shift;
if (defined($self->missing()) && ($datum eq $self->missing())) {return undef;};
# It's missing
if (defined($self->transform())) {
# It needs to be transformed, and it's not missing
my $coderef = $self->transform();
$datum =&$coderef($datum);
return $datum;
}
#Just pass it through
return $datum;
}
=head1 Get and Set functions
Data::Validator implements a policy to decide on the acceptability or otherwise
of scalar value, and to transform this value for output. The B<Set> functions
allow you to define the policy. These functions require an argument. These
functions are most likely to be used when creating a Data::Validator.
The corresponding B<Get> functions are intended for use B<only> within the
Data::Validator, when creating the put() and validate() functions. These are the
no argument functions.
=head2 name()
name() sets or gets the name of the Data:Validator - I use this just to remind me, and
I usually set it to the name of the variable. This doesn't get used anywhere else - it's just
icing, but it sure makes debugging easier.
C<< $item->name("Item"); >>
=cut
sub name {
my $self = shift;
if (@_) { $self->{NAME} = shift }
return $self->{NAME};
}
=head2 missing()
missing() gets or sets the missing value for a Data::Validator. This does matter, because
missing values are acceptable to validate(), and because put() changes missing values to undef.
This is used by *both* put() and validate(). If you don't understand why missing values are
*acceptable* you need to think harder about the problem we're solving here.
Would you like missing() to accept several alternative missing values? Let me know...
C<< $item->missing(""); >>
=cut
sub missing {
my $self = shift;
if (@_) { $self->{MISSING} = shift }
return $self->{MISSING};
}
=head2 min()/max()
min() and max() get and set the lower and upper limits for a Data::Validator. These are
used by validate() to check whether a value is greater than or less than a limit. These could
be used for character data, but really make more sense for numeric values. Note that I
don't really understand how min and max work for character data yet. Note also that perl
may occasionally require you to tell it that a variable is numeric. (try adding 0 to it if this
problem arises).
C<< $item->min(-5) >>
or
C<< $item->max(42) >>
=cut
sub min {
my $self = shift;
if (@_) { $self->{MIN} = shift }
return $self->{MIN};
}
sub max {
my $self = shift;
if (@_) { $self->{MAX} = shift }
return $self->{MAX};
}
=head2 transform()
transform() sets or gets a reference to a subroutine, a reference of type CODE. This
is used by put() to change the value of a variable. This is very flexible, and has covered
all of my needs so far.
C<< $item->transform(\&test) >>
=cut
sub transform {
my $self = shift;
if (@_) {
my $ref = shift;
if (_ref_check($ref,'CODE')) { # Is it a CODEREF??
$self->{TRANSFORM} = $ref;
return $self->{TRANSFORM};
}
} # if(@_)
return $self->{TRANSFORM};
}
=head2 verify()
verify() sets or gets a reference to a subroutine, a reference of type CODE. This is
used by validate() to check if a variable complies with certain rules. This is the most
complicated method of testing a value but it can be very useful in some circumstances.
Remember there isn't any built in way to use the value of *another* variable from the
same record in this subroutine.
C<< $item->verify(\&test); >>
=cut
sub verify {
my $self = shift;
if (@_) {
my $ref = shift;
if (_ref_check($ref,'CODE')) { # Is it a CODEREF??
$self->{VERIFY} = $ref;
return $self->{VERIFY};
}
} # if(@_)
return $self->{VERIFY};
}
=head2 values()
values() sets or gets an array reference containing all of the possible values of a variable.
This is used by validate() to check if a variable has one of a list of values. The array reference gets
turned into a hash internally so that I can use exists(), but in Perl 5.8 and up exists() works for arrays.
I chose to initialise this using array references because the syntax is easy -
C<< $item->values(0,1,2,3,4); >>
or
C<< $item->values(\@array); >>
=cut
sub values {
my $self = shift;
if (@_) {
my $ref = shift;
if (_ref_check($ref,'ARRAY')) { # Is it an ARRAY reference?? $self->{TRANSFORM} = $ref;
my %hash;
grep { ! $hash{$_} ++ } @$ref; #Perl Cookbook Recipe 4.6 Thanks!
$self->{VALUES} = \%hash;
return $self->{VALUES};
}
} # if(@_)
return $self->{VALUES};
} #End of subroutine values
=head1 PRIVATE FUNCTIONS
=head2 _ref_check()
_ref_check() is a private subroutine which looks to see if a reference refers to what you expect. Don't
use it.
=cut
sub _ref_check {
my ($test,$should_be) = @_;
#Why doesn't this get called with self as it's first argument?
my $ref = ref($test);
unless ($ref eq $should_be) {
if (length($ref) > 0) {
carp ("\n>> $test isn't a reference to an array, but rather a reference to a ".$ref."\n")
}
else
{
carp ("\n>> $test isn't an array reference at all, but a SCALAR\n")
}# if (defined($refref))
return 0;
} # unless ($ref eq $should_be)
return 1;
} #End of subrotuine _ref_check
return 1; #Required for all modules
=head1 KNOWN BUGS
min() and max() don't really work for non-numeric values, arguably they should
=head1 AUTHOR
Anthony Staines <Anthony.Staines@ucd.ie>
=head1 TO DO
This is an alpha release. I am actively seeking feedback on the user interface.
Please let me kow what you think.
The validate and put functions are called a lot - several hundred thousand times
in my applications. The program spends most of it's time executing these. (Confirmed
by profiling). I will implement an eval based version of these.
Try anthony.staines@ucd.ie with your comments
=head1 SEE ALSO
L<perl>.
=head 1 COPYRIGHT AND DISCLAIMER
This program is Copyright 2002,1990 by Anthony Staines. This program is free software;
you can redistribute it and/or modify it under the terms of the Perl Artistic License or the
GNU General Public License as published by the Free Software Foundation; either
version 2 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY;
without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
See the GNU General Public License for more details.
If you do not have a copy of the GNU General Public License write to the Free Software Foundation, Inc.,
675 Mass Ave, Cambridge, MA 02139, USA.
=cut
|
|