Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Data Validation Tests

by Flame (Deacon)
on Jan 25, 2003 at 22:51 UTC ( #229891=perlmeditation: print w/ replies, xml ) Need Help??

Reference: New Module Consideration?

Well, I've decided to go through with my idea, so now I'm wondering what types of tests would the community find most useful. The current list of tests being considered:

  • Credit Card - verify that it LOOKS like one
  • Date - DD/MM/YYYY etc... perhaps check that's its possible
  • E-Mail - basic syntax check, might attempt to use Email::Valid
  • INT - integer vs float
  • IP - could it be an ip?
  • Time - obvious...
  • URI - parseable by URI
  • Year - similar to int probably
  • HCOLOR - a valid HEX color def
  • HTML - has valid *looking* syntax


Also, as requested:
  • Sum = specified #
  • Sum < specified #
  • Sum > specified #
  • Sum <= specified #
  • Sum >= specified #
  • Sum != specified #


Any requests?



My code doesn't have bugs, it just develops random features.

Flame ~ Lead Programmer: GMS | GMS

Comment on Data Validation Tests
Re: Data Validation Tests
by dempa (Friar) on Jan 25, 2003 at 23:29 UTC
Re: Data Validation Tests
by Aristotle (Chancellor) on Jan 26, 2003 at 00:14 UTC
    To parse an IP, simply pass it to Socket::inet_aton().. and the majority of your other tests should go through the very reliable Regexp::Common.

    Makeshifts last the longest.

Re: Data Validation Tests
by IlyaM (Parson) on Jan 26, 2003 at 09:23 UTC
    Make API open so anybody can add new test types with plugin modules without modifying source code of your module.

    BTW have you looked on Data::Verify? It seems to be a similar project. Have you considering cooperation?

    --
    Ilya Martynov, ilya@iponweb.net
    CTO IPonWEB (UK) Ltd
    Quality Perl Programming and Unix Support UK managed @ offshore prices - http://www.iponweb.net
    Personal website - http://martynov.org

      An open API is already in the design. Just load the module and refer to it as a 'custom' test. I already have a test module (count and compare) that uses that mechanism because it needed to be able to curry the function.

      As for Data::Verify, I have looked at it, but I'm not sure if it's really practical to try to combine them directally. It's still on my "To Be Considered" list.



      My code doesn't have bugs, it just develops random features.

      Flame ~ Lead Programmer: GMS | GMS

        Use "Data::Type" instead (was Data::Verify).
        
         - Now "quite" stable api (alpha).
         - 90% of your requested "value types"
         - Documentation extended.
         - Added more tests.
        
        I am always happy about new ideas and contributions via http://www.sf.net/projects/datatype or directly to me. Greetings, Murat (murat.uenalan@cpan.org) </re>
Re: Data Validation Tests
by Aragorn (Curate) on Jan 26, 2003 at 11:42 UTC
    Maybe it's just me, but I like to have validation routines that go with the modules which actually deal with the data type at hand. For example, Business::CreditCard has a validate function. The documentation of URI shows the "official" regex to match an URI. Date::Calc functions will return an error when fed an invalid date.

    The terms strong cohesion and loose coupling spring to mind. Functions for validating data should be part of the module processing that data. Having a separate module with all kinds of unrelated validation routines increases the risk of (possibly) not keeping up with changes in the format, and I don't really need credit card number validation routines in my logfile processing script which extracts IP addresses.

    Arjen

      While your concern is valid, the idea is to have a way to summarize all validation in a declaration. The benefit is that you avoid manually writing the validation logistics, which is incredibly boring work. Something like
      my @_id = $cgi->params('id'); error_exit("You must select at least one ID.") unless @_id; my @id; for(@_id) { my ($match) = /\A(\d{5,9}[ABC])\z/; error_exit("Invalid ID: $_") unless defined $match; push @id, $match; }
      Even if you replace the regex by ID::validate(), the whole thing remains rather clunky - esp if you imagine that you have to write such a snippet for 25 different parameters: pure drone work. It is much easier to use a Data::FormValidatoresque declaration like
      my $validator = Data::FormValidator->new({ delete_items_page => { required => [qw(id .. ..)], constraints => { id => '/\A(\d{5,9}[ABC])\z/', # .. }, }, });

      While this doesn't compare favourably so far, adding more parameters to validate would be trivial and quick with the latter code. Adding another two dozen checks is easy and doesn't result in an unmanageable amount of code.

      I do agree that the validator should not contain its own actual validation routines. That's why I don't like Data::FormValidator as is; I would prefer there being plugin modules that integrate other modules' own supplied validation routines - much the way File::Find::Rule works.

      Makeshifts last the longest.

        Well then you should be glad to hear that the current design does not have any 'built in' routines. It has some default checks, held in a seperate module that comes with the main verification package, but it is very easy to include checks from other places. Basically anything that returns true on sucess and false on failure can be used, via the 'custom' test declaration (which accepts coderef's and regex's.)



        My code doesn't have bugs, it just develops random features.

        Flame ~ Lead Programmer: GMS | GMS

Re: Data Validation Tests
by Abigail-II (Bishop) on Jan 26, 2003 at 23:56 UTC
    I would like to have all of those tests in Regexp::Common. Some of the proposed tests are already part of Regexp::Common, like the numeric tests (integers, reals) and the IP addresses. Some URI classes are in there as well (http, ftp, tel, fax and tv at the moment).

    But doing validation is a lot harder than you think. You need to find authorative documentation (there are many, many URI schemes, there are only a few URI schemes that have RFC that aren't either ambiguous, unclear from which conflicting RFCs they import terms, or defined in superceeded RFCs - but not in the superseeding RFC itself. A lot of schemes are only documented in internet drafts, of which the latest has expired years ago), and regexes are hard to test right. You have to consider lots of cases, and combinations of cases, and also a lot of cases where the regex should fail. Two weeks ago, I redid the test suite for http URIs, which is actually one of the better defined URI schemes, and it took me two full nights to get it all working. It did turn up a few bugs as well.

    I've wanted to add dates to Regexp::Common for quite some time as well, but were do you start? There are so many forms to choose from. Perhaps start with dates in ISO format? It sounds simple, until you actually read the 33 page specification.

    Email addresses.... Once, they will be part of Regexp::Common. I've done them using Parse::RecDescent (in RFC::RFC822::Address), and it won't be a pretty regex, as it will be recursive. I haven't had the guts to do this beast yet.

    I don't think valid credit card numbers would be hard - but I lack their specification. If you can provide me with it, I'll add it to Regexp::Common (but the spec should be better than "a 14 digit number").

    Send me regexes and specifications, preferably with an extensive test suite, and I'll add it to Regexp::Common (current version: 2.104, 87 patterns in 11 classes, 156778 tests in 30 files).

    Abigail

      >the spec should be better than "a 14 digit number"

      Actually even that would be wrong.

      The spec for credit cards includes 13- to 16-digit numbers as well.

      The spec would be, roughly, "a 13-to-16-digit Luhn number beginning with one of a list of prefixes."

      There's an article about it here and a Perl implementation for checking here
      --
      “Every bit of code is either naturally related to the problem at hand, or else it's an accidental side effect of the fact that you happened to solve the problem using a digital computer.” M-J D

Re: Data Validation Tests
by shotgunefx (Parson) on Jan 27, 2003 at 01:30 UTC
    I created a similar work for a project once. One thing I would suggest is optional (min,max) parameters. What they would do would depend on the context.

    For int, float, other numbers, it could be used to make sure they are within a given range.

    For date types that the time is within a given period.

    For text that the length is within the allowed limits.

    Also I would add TEXT and ALPHANUMERIC /^\w-+$/ to your type list.

    -Lee

    "To be civilized is to deny one's nature."
Why reinvent the wheel?
by autarch (Hermit) on Jan 27, 2003 at 04:29 UTC

    Please, before you release yet another redundant module (YARM), please consider whether not what you want wouldn't be better done as patches to existing modules, including, but not limited to ...

    Params::Validate
    Data::Verify
    Data::FormValidator
    Data::Validator::Item
    Class::ParamParser & Class::ParmList
    Getargs::Long

    Plus related modules like Regexp::Common, Email::Valid, Business::CreditCard, and more.

    Yes, your proposed API is somewhat different from the existing offerings. But is it so different that it offers completely unique functionality? I don't think so. In fact, it seems fairly close to Data::Validator::Item.

    One of the big problems with CPAN is that people just go ahead and upload more and more modules that basically do the same thing as everything else, just with a different API. Some categories are overwhelmingly full (DBI wrappers, for example). Parameter validation isn't quite at the point yet, but it's getting close.

    If there's something you really want that doesn't exist, pick the existing module you like best, and offer the author patches. If that doesn't work out, then consider creating your own module.

      You have a valid argument, and I haven't counted out simply taking what I have and converting it to a patch. Then again, I am also looking at becoming 'glue' by attaching the best elements I can find of each, as well as using other modules to perform tests. For example, the date tests I hope to do with Date::Parse, part of the TimeDate package. Though I admit, this does mean that there would be a great deal of 'reccomended' modules along with D::V::O.

      Thank you for the advice. Even if this is eventually rejected by the community as a whole, it will still be practice, and another opportunity to expand my skill with perl. (Flame: Remembers 2 "OOh, what does THIS do"'s that occured during the development up to this point)

      Edit: Oh, I would like to know what similaities you see between this and Data::Validator::Item



      My code doesn't have bugs, it just develops random features.

      Flame ~ Lead Programmer: GMS (DOWN) | GMS (DOWN)

        I just repeat myself. But i invested much effort into "Data::Type" (wa +s Data::Verify) which actually does all what you are talking about. I +t encapsulates many CPAN "value type" checking modules already. Business::ISSN 0.90 by ISSN Locale::SubCountry by LOCALE::COUNTRYCODE, LOCALE::COUNTRYNAME, LO +CALE::REGIONCODE, LOCALE::REGIONNAME Net::IPv6Addr 0.2 by IP Locale::Language 2.02 by LOCALE::LANGCODE, LOCALE::LANGNAME Business::CINS 1.13 by CINS Email::Valid 0.14 by EMAIL Date::Parse 2.23 by DATE Business::CreditCard 0.27 by CREDITCARD Regexp::Common 2.104 by INT, IP, QUOTED, REAL, URI Business::UPC 0.02 by UPC An update will hit CPAN/SF.net soon. From the pod: Data::Type x.x.x supports 44 types: BINARY - binary code BOOL - a true or false value CINS 0.1.3 - a CUSIP International Numbering System Nu +mber BIO::CODON 0.1.3 - a DNA (default) or RNA nucleoside triphos +phates triplet LOCALE::COUNTRYCODE 0.1.5 - country code LOCALE::COUNTRYNAME 0.1.5 - country name CREDITCARD - is one of a set of creditcard type (DINER +S, BANKCARD, VISA, .. DATE 0.1.1 - a date (mysql or Date::Parse conform) DATETIME - a date and time combination DEFINED 0.1.4 - a defined (not undef) value DK::YESNO - a simple answer (ja, nein) BIO::DNA 0.1.3 - a dna sequence DOMAIN 0.1.4 - a network domain name EMAIL - an email address ENUM - a member of an enumeration GENDER - a gender male, female HEX - hexadecimal code INT - an integer IP 0.1.4 - an IP (V4, V6, MAC) network address ISSN 0.1.3 - an International Standard Serial Number LOCALE::LANGCODE 0.1.3 - a Locale::Language language code LOCALE::LANGNAME 0.1.3 - a language name LONGTEXT - text with a max length of 4294967295 (2^3 +2 - 1) characters (.. MEDIUMTEXT - text with a max length of 16777215 (2^24 +- 1) characters (al.. NUM - a number OS::PATH 0.1.6 - a path string (not really functional) PORT 0.1.4 - a network port number QUOTED - a quoted string REAL - a real REF - a reference to a variable LOCALE::REGIONCODE 0.1.5 - region code LOCALE::REGIONNAME 0.1.5 - region name BIO::RNA 0.1.3 - a rna sequence SET - a set (can have a maximum of 64 members ( +mysql)) TEXT - blob with a max length of 65535 (2^16 - 1 +) characters (alias.. TIME - a time (mysql) TIMESTAMP - a timestamp (mysql) TINYTEXT - text with a max length of 255 (2^8 - 1) c +haracters (alias my.. UPC 0.1.3 - standard (type-A) Universal Product Code URI - an http uri VARCHAR - a string with limited length of choice (d +efault 60) WORD - a word (without spaces) YEAR - a year in 2- or 4-digit format YESNO - a simple answer (yes, no) And 4 filters: chomp - chomps lc - lower cases strip - strip uc - upper cases TYPES BY GROUP Locale LOCALE::COUNTRYCODE, LOCALE::COUNTRYNAME, LOCALE::LANGCODE, LOCALE:: +LANGNAME, LOCALE::REGIONCODE, LOCALE::REGIONNAME Logic BIO::CODON, BIO::DNA, BIO::RNA, DEFINED, DOMAIN, EMAIL, IP, OS::PATH +, PORT, REF, URI Database Logic ENUM, SET Time or Date related DATE, DATETIME, TIME, TIMESTAMP, YEAR String LONGTEXT, MEDIUMTEXT, TEXT, TINYTEXT Business CINS, CREDITCARD, ISSN, UPC W3C String BINARY, HEX Numeric BOOL, INT, NUM, REAL String DK::YESNO, GENDER, QUOTED, VARCHAR, WORD, YESNO GROUP "Database" These are types identical to mysql database builtin types. CREDITCARD This type isn't tested at all and nobody should rely on it without rig +orous testing. Supported are: 'Diners Club', 'Australian BankCard', 'VISA', 'Discover +/Novus', 'JCB', 'MasterCard', 'Carte Blache', 'American Express'. They are parameterised as: DINERS, BANKCARD, VISA, DISCOVER, JCB, MAST +ERCARD, BLACHE, AMEX. CONTRIBUTIONS The author is happy to receive more types (formats) and add to this li +brary. If you have a algorithm/regex for validating it, the better. Just email me. PREREQUISITES Class::Maker (0.05.10), Error (0.15), IO::Extended (0.05), Tie::ListKeyedHash (0.41), Iter (0) and for types Business::ISSN 0.90 by ISSN Locale::SubCountry by LOCALE::COUNTRYCODE, LOCALE::COUNTRYNAME, LO +CALE::REGIONCODE, LOCALE::REGIONNAME Net::IPv6Addr 0.2 by IP Locale::Language 2.02 by LOCALE::LANGCODE, LOCALE::LANGNAME Business::CINS 1.13 by CINS Email::Valid 0.14 by EMAIL Date::Parse 2.23 by DATE Business::CreditCard 0.27 by CREDITCARD Regexp::Common 2.104 by INT, IP, QUOTED, REAL, URI Business::UPC 0.02 by UPC
Re: Data Validation Tests
by demerphq (Chancellor) on Jan 28, 2003 at 14:58 UTC
    Date - DD/MM/YYYY etc... perhaps check that's its possible

    You should be aware that this date format is particularly nonstandard and difficult to check. Use of ISO date formats YYYY/MM/DD or similar should be standard practice for all professional programmers. In fact I would even go so far as to suggest that a programmer that blindly allows DD/MM/YYYY even at the request of a client is doing the client a disservice. ISO standards and their relatives (DIN for example specifies the same format) are there for a reason.

    Incidentally to back this claim up consider that MM/DD/YYYY and DD/MM/YYYY are both popular written date formats. Unfortunately there is no way to determine if 02/03/2003 is the third of Febuary or the second of March. There is no such ambiguity in YYYY/MM/DD. (Or if there is I would argue it is of signifigantly less likelyhood of occuring.)

    --- demerphq
    my friends call me, usually because I'm late....

      That was primaraly an example. I'll be employing Date::Parse in the test though, which I do believe supports that...

      Thanks for the correction though.



      My code doesn't have bugs, it just develops random features.

      Flame ~ Lead Programmer: GMS (DOWN) | GMS (DOWN)

Re: Data Validation Tests
by hsmyers (Canon) on Jan 29, 2003 at 19:39 UTC

    Given the amount of time I hang-out in book space, I would have liked to see something along these lines for ISBN numbers. When last I checked CPAN there was at least one solution (more actually) but most required something along the lines of use Kitchen::Sink; so after a bit of looking around I came up with:

    sub checkISBN { my @digits = split(//,uc(shift)); my $n = scalar(@digits); my $sum = 0; my $m = 10; my $cd; if ($n != 10) { return (0,($n < 10 ? '-' : '+')); } else { for (0..@digits - 2) { $sum += $digits[$_] * $m--; } $cd = qw(0 X 9 8 7 6 5 4 3 2 1)[$sum % 11]; return ($cd eq $digits[-1],$cd); } }
    Don't know if this is what you had in mind, but I found it useful...

    --hsm

    "Never try to teach a pig to sing...it wastes your time and it annoys the pig."
      Hmm, ISBN. Sounds reasonable. Thanks for the suggestion, and the sample. I'll see if I can work it into the plan.



      My code doesn't have bugs, it just develops random features.

      Flame ~ Lead Programmer: GMS (DOWN) | GMS (DOWN)

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlmeditation [id://229891]
Approved by Tanalis
Front-paged by Trimbach
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others wandering the Monastery: (11)
As of 2014-07-31 11:02 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (248 votes), past polls