Beefy Boxes and Bandwidth Generously Provided by pair Networks
Come for the quick hacks, stay for the epiphanies.
 
PerlMonks  

Efficient way to do field validation

by govindkailas (Acolyte)
on Jul 31, 2013 at 12:04 UTC ( [id://1047244]=perlquestion: print w/replies, xml ) Need Help??

govindkailas has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks, I have a pipe delimited file having columns in different format. What would be the best way to validate it.
eg) file.txt ------------- int|decimal(5,3)|varchar(5)|date|varchar(8)|decimal(14,3)... actual values 12|11.00|BILL|20130131|asd123q|1234.45|.. ... ... I
I am splitting the columns and taking it to variables. Now how should I validate each fields ? The max value of the column is specified in the bracket. Eg, column 3 is defined as varchar(5) , it should not go more than 5byte. if a field is failing for validation I need to write the failure notification to another file. Expected output would be something like below
int|decimal field not in range|varchar(5)|date|varchar(8)|decimal(14,3 +)...

Replies are listed 'Best First'.
Re: Efficient way to do field validation
by tobyink (Canon) on Jul 31, 2013 at 12:56 UTC

    How about something like this...

    use Type::Params qw(compile); use Types::XSD qw(String Decimal Date Integer); use Text::CSV_XS; use Data::Dumper; my $validator = compile( Integer, Decimal[totalDigits => 8, fractionDigits => 3], String[maxLength => 5], Date->plus_coercions( Integer[totalDigits => 8], q{ substr($_, 0, 4)."-".substr($_, 4, 2)."-".substr($_, 6, 2) +} ), String[maxLength => 8], Decimal[totalDigits => 17, fractionDigits => 3], ); my $csv = 'Text::CSV_XS'->new({ sep_char => '|' }); while (my $row = $csv->getline(\*DATA)) { my @fields = $validator->(@$row); print Dumper \@fields; } __DATA__ 12|11.00|BILL|20130131|asd123q|1234.45 14|12.0|MONKEY|20120228|gkhkg|1.2

    Produces the following output:

    $VAR1 = [ '12', '11.00', 'BILL', '2013-01-01', 'asd123q', '1234.45' ]; Value "MONKEY" did not pass type constraint "String[maxLength=>"5"]" ( +in $_[2]) at validate-csv.pl line 21.

    If you've got big files, then you're unlikely to find a faster solution than pairing Text::CSV_XS and Type::Params.

    package Cow { use Moo; has name => (is => 'lazy', default => sub { 'Mooington' }) } say Cow->new->name
Re: Efficient way to do field validation
by ww (Archbishop) on Jul 31, 2013 at 12:42 UTC
    What have you tried? We expect you to tell us what you've been doing, so we can help you learn. This is not code-for-free.com; if you're merely looking for someone to provide code, you may want to see if they're up.

    And you definitely have tell us why the first decimal value, "11.00" fails your notion of validation -- apparently because the length doesn't match the max digit or decimal digit counts in the spec "decimal(5,3)" -- while the name "BILL" passes but clearly isn't the max of "varchar(5)".

    OTOH, it looks to me as though you can solve most of the rest of your problem by reading perldoc -f length and or perldoc perlre with specific reference to quantifiers.

    My apologies to all those electrons which were inconvenienced by the creation of this post.
      I am not here for asking how to validate the fields. Neither I am expecting code-for-free. As I mentioned in the original post I am selecting and validating each field using appropriate regex. What I am looking for is a better method to do the validation - something similar to c++ class definition. Can we have a hash defined with specific regex keys and check if the value match ?
        Yes.

        Update (after keeping my peace long enough to reach a slow burn):

        What you said about validating in the OP was "Now how should I validate each fields ?" which I don't read as congruent with "(a)s I mentioned in the original post I am selecting and validating each field using appropriate regex" as you're now asserting.

        Yes, you stated that you were splitting the record into "columns and taking it to variables." -- again, a statement at some remove from your new version .

        So yes, I'm taking offense at your reply, as you did at my reply -- an attempt to point out two obvious ways to do some form of validation (and a request that you provide your criteria for determining if an entry is valid). It's not, IMO, a gracious response to an attempt to help with what appeared to be a noob question.... posed in the manner of someone who hasn't read On asking for help and How do I post a question effectively?. (Without your code, it's hard to guess if one can provide a more efficient was to do field validation.)

Re: Efficient way to do field validation
by Laurent_R (Canon) on Jul 31, 2013 at 22:37 UTC

    You could build a hash of regexes, something like this (the regexes are just given as quick simplistic examples, I haven't thought very carefully about them).

    my %validate = ( INT => qr /^[+-]?\d+$/, DEC => qr /^[+-]?\d*\.?\d*$/, #etc. );

    and then use it to validate your individual fields.

    You might actually take it one step further and build a dispatch table, something like this:

    my %actions = ( INT => sub { return 1 if $_[0] =~ /^[+-]?\d+$/}, DEC => sub { return 1 if $_[0] =~ /^[+-]?\d*\.?\d*$/}, VARCHAR(5)=> \&validate_varchar_5(@_), #etc. );

    This is a very rough untested example, I just want to convey the general idea.

    You don't say enough about what you have done to figure out whether these techniques will be beneficial.

      Thanks a lot, I was thinking about something similar like this. This made things clear to me.
Re: Efficient way to do field validation
by zork42 (Monk) on Jul 31, 2013 at 12:10 UTC
    You need to define what you mean by "validate" before anyone can help :)
    Also some more example input and expected output would help.


    UPDATE: more detail was added to OP after I wrote this
Re: Efficient way to do field validation
by sundialsvc4 (Abbot) on Aug 01, 2013 at 01:14 UTC

    When faced with tasks like this, I usually write a subroutine which, given a string (or whatever the input may be), is tasked with returning either “falsehood,” or, if any error is encountered, an appropriate error-message string.

    I normally wrap the entire body of such a function in an eval {} block, which will trap any errors that may occur.   If an exception is thrown (via die or otherwise), the content of that exception string is returned; if not, falsehood.   I also often define a $doing_what variable that I set to appropriate strings as I run through the subroutine from top to bottom.   This value can be used to augment the messages.

    And then... what can I say... you just go for it.   split() the string into an array, then check the number of entries in the array:   die() if the count is wrong.   Then, on to the next test.   And you simply run through them, one after another after another.

    Now, one more thing:   welcome to the world of Test::More and Test::Exception!   You must not assume that your validation routine is, indeed, correct.   You need to write a very comprehensive test-suite that throws everything but Lincoln’s Gettysburg Address(*) at it.   This test suite should verify that the routine traps every error that it is supposed to, and that it validates every good string that it is supposed to.   This is a complex but vitally important routine, and you need to test it rigorously.

    (*) Yes, there’s a story here .. apocryphal or otherwise I don’t know.   Legend has it that an early “error-correcting” COBOL(?) compiler, when given a copy of the aforesaid document, “compiled it” with no errors.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1047244]
Approved by Happy-the-monk
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others romping around the Monastery: (7)
As of 2024-04-23 14:35 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found