Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

RFC: Name and/or API for module ("Type::FromData")

by Corion (Pope)
on Apr 09, 2014 at 18:00 UTC ( #1081705=perlquestion: print w/ replies, xml ) Need Help??
Corion has asked for the wisdom of the Perl Monks concerning the following question:

Hello all!

I often deal with random data that is to be put into an SQL table. This data usually comes to me already in tabular structure, or as XML. But it almost certainly always lacks the type information needed to create a proper SQL table for it.

Lazy as I am, I wrote me a short and stupid module, which I can hand all values for a column and which then gives me back the "best" SQL type that fits all values in the column.

The usage of the code is something like the following:

use Type::FromData; my $dt= Type::FromData->new(); my @data= ( { fool => 1, when => '20140401', greeting => 'Hello', value => '1,05' }, { fool => 0, when => '20140402', greeting => 'World', value => '99,05' }, { fool => 0, when => '20140402', greeting => 'World', value => '9,005' }, ); $dt->guess( @$data ); print $dt->as_sql( table => 'test' ); # CREATE TABLE test # fool decimal( 1,0 ), # when date, # greeting varchar(5), # value decimal(5,2) # )

Before I release that code onto CPAN, I have three questions:

  1. Certainly, somebody has already written the class to describe an SQL type. Maybe from the Moo* followers, or some other people who write Type modules. But I couldn't find anything like that, and plain strings suit my (limited) use case quite fine, except if a date could be "null" (that is, the empty string). So where are the classes that implement the appropriate attributes (length for varchar resp. precision/decimals for decimal) and the stringification?
  2. As always, I am very bad at naming things. What would be a good name for this module? So far, I've stuck with Type::FromData, but the only kind of type it returns will be an SQL declaration string or a CREATE TABLE statement, and never an abstract type that a different SQL dialect might handle.
  3. Curently, the data type recognition is done by three regular expressions that deal with the three kinds of data I get, "dates", "numbers" and "strings". Maybe it would be interesting to expand
package Type::FromData; use strict; sub new { my( $class, %options )= @_; $options{ column_type } ||= {}; $options{ column_map } ||= { ";date" => 'date', ";decimal" => 'decimal(%2$d,%3$d)', ";varchar" => 'varchar(%1$d)', "date;" => 'date', "decimal;" => 'decimal(%2$d,%3$d)', "varchar;" => 'varchar(%1$d)', "varchar;date" => 'varchar(%1$d)', "varchar;decimal" => 'varchar(%1$d)', "varchar;varchar" => 'varchar(%1$d)', "date;decimal" => 'decimal(%2$d,%3$d)', "date;varchar" => 'varchar(%1$d)', "date;date" => 'date', "decimal;date" => 'decimal(%2$d,%3$d)', "decimal;varchar" => 'varchar(%1$d)', "decimal;decimal" => 'decimal(%2$d,%3$d)', ";" => '', }; bless \%options => $class; } sub column_type { $_[0]->{column_type} }; sub column_map { $_[0]->{column_map} }; sub guess_data_type { my( $self, $type, @values )= @_; my $column_map= $self->column_map; for my $value (@values) { my $old_type= $type; my $this_value_type= ''; my $pre= 0; my $post= 0; my $length= length $value || 0; if( ! defined $value or $value =~ /^$/) { # ... nothing to guess here } elsif( $value =~ /^((?:19|20)\d\d)-?(0\d|1[012])-?([012]\d|3 +[01])$/) { $this_value_type= 'date'; $pre= 8; } elsif( $value =~ /^[+-]?(\d+)$/) { $this_value_type= 'decimal'; $pre= length( $1 ); $post= 0; } elsif( $value =~ /^[+-]?(\d+)\.(\d+)$/) { $this_value_type= 'decimal'; $pre= length( $1 ); $post= length( $2 ); } else { $this_value_type= 'varchar'; }; if( $type ) { if( $type =~ s/\s*\((\d+)\)// ) { $length= $1 > $length ? $1 : $length; } elsif( $type =~ s/\s*\((\d+),(\d+)\)// ) { my( $new_prec, $new_post )= ($1,$2); my $new_pre= $new_prec - $new_post; $pre= $new_pre > $pre ? $new_pre : $pre; $post= $2 > $post ? $2 : $post; }; } else { $type= ''; }; if( $type ne $this_value_type ) { if( not exists $column_map->{ "$type;$this_value_type" }) +{ die "Unknown transition '$type' => '$this_value_type'" +; }; }; $type= sprintf $column_map->{ "$type;$this_value_type" }, $len +gth, $pre+$post, $post; }; $type }; sub guess { my( $self, @records )= @_; my $column_type= $self->column_type; for my $row (@records) { for my $col (keys %$row) { my( $new_type )= $self->guess_data_type($column_type->{$co +l}, $row->{ $col }); if( $new_type ne ($column_type->{ $col } || '')) { #print sprintf "%s: %s => %s ('%s')\n", # $col, ($column_type{ $col } || 'unknown'), ($new_ +type || 'unknown'), $info->{$col}; $column_type->{ $col }= $new_type; }; } } } sub as_sql { my( $self, %options )= @_; my $table= $options{ table }; my $user= defined $options{ user } ? "$options{ user }." : '' ; my $column_type= $self->column_type; $options{ columns }||= [ sort keys %{ $column_type } ]; my $columns= join ",\n ", map { "$_ $column_type->{ $_ }" } + @{ $options{ columns }}; my($sql)= <<SQL; create table $user$table ( $columns ); SQL return $sql; } 1;

Comment on RFC: Name and/or API for module ("Type::FromData")
Select or Download Code
Re: RFC: Name and/or API for module ("Type::FromData")
by InfiniteSilence (Curate) on Apr 09, 2014 at 20:19 UTC

    The first practical problem I run into with this module is that the user really isn't free to choose any name for a field -- in your example the use of 'when' in unacceptable in PostgreSQL. Perhaps some kind of target setting like 'forMYSQL' or 'forPostgres' would generate some kind of error when a reserved word is utilized for a field name (or maybe munge it with something to make it acceptable).

    Second, I would prefer to be able to override the regexes that define what a particular thing is -- a date, a numeric type, or a string -- with my own regexes.

    Third, some RDBMS allow for custom data types to be defined. This module should have some allowance for those.

    Otherwise I think this module is a good idea. The name seems okay. I would slap some pod into it and give it a very low version number.

    BTW, this line is wrong:

    $dt->guess( @$data );
    Should be,
    $dt->guess( @data );

    Celebrate Intellectual Diversity

      Indeed - I have thought about extensibility and the SQL dialects, but I really think those should come after there is some generic type definition instead of strings getting passed around. But I don't want to write the code for the generic data type myself :)

      WHEN is not a good name as it is a keyword in the CASE statement.
Re: RFC: Name and/or API for module ("Type::FromData") ( infer sql best guess cast type match sqltranslator )
by Anonymous Monk on Apr 09, 2014 at 22:30 UTC

    After reading this thread immediately thought of sqlt, looked inside, no premade thing :) so off to search I went

    So my one idea (best of the bunch) is to stick it in SQL::Translator::Parser::GuessBestCastTypeMatchPerl , base it on SQL::Translator::Parser::JSON, make it take either perl ref or perl string to safe undumper :) ... so that illegal field names are SQL::Translator::Producer problem (not yours)

    Another is Data::InferBestGuessSQLSchemaType but pause says Avoid the too-general nouns like Devel, Sys, Text, Data

    ?? Send a patch to stick it inside Type::Utils making sure that all the keywords are represented in the docs for search purposes, but Type doesn't feel better than Data -- but yeah, patches are hassles :)

    So maybe SQL::Abstract::InferSchemaTypesByBestMatchGuessFromPerlData :)

    I definitely feel sql/schema/type/infer ought to be in the name some how, and the all the keywords in the  =head1 NAME  ... NAME - Perl extension ...

      Thank you very much for your research!

      I think illegal names are mostly a problem of the output and/or input, and not of this module, but that is also why I want to move away from producing strings as output myself.

      I'm a bit wary of pulling in the prerequisites of SQL::Translator, for what should be fairly simple functionality. In the long run, maybe the module should just produce a data structure that SQL::Translator can consume. Most likely, that would be something duck-typed to what SQL::Translator::Parser::JSON produces... SQL::Translator::Schema::Field looks like a good target/output interface though!

      From your discussion of names, I think Guess or Infer should be a relevant keyword, and Data and Type are also very relevant. Names are hard :-/

      You can create a sub in Perl called foo*bar. (Yes, including the asterisk.) You can't do it the normal way like this, because that's bad syntax:

      sub foo*bar { return 42; } print foo*bar(), "\n";

      But all is fine and dandy if you quote the name properly (i.e. use symbolic references):

      my $name = "foo*bar"; *$name = sub { return 42; }; print &$name(), "\n";

      Similarly, PostgreSQL is perfectly happy for you to name a column "when". It's just that the syntax of SQL requires you to quote it.

      $ psql tai psql (9.1.13) Type "help" for help. tai=# CREATE TABLE "foo" ("when" varchar, "select" varchar, "update" v +archar); CREATE TABLE tai=# INSERT INTO "foo" VALUES ('a','b','c'); INSERT 0 1 tai=# SELECT * FROM "foo"; when | select | update ------+--------+-------- a | b | c (1 row) tai=# DROP TABLE "foo"; DROP TABLE tai=#

      (For MySQL, the quote character is ` instead of ", though MySQL does have an option to let it speak proper, grown-up SQL.)

      If you're writing code that needs to deal with an unknown database schema, you should always quote SQL identifier names (e.g. table names, column names) because you don't know what kind of crazy stuff is going to get thrown at you.

      use Moops; class Cow :rw { has name => (default => 'Ermintrude') }; say Cow->new->name
        Quoting retains case, meaning it will have to be quoted when used, which can be quite inconvenient and very confusing.
Re: RFC: Name and/or API for module ("Type::FromData")
by tobyink (Abbot) on Apr 10, 2014 at 10:44 UTC

    "SQL::Type::Guess"?

    If it had the ability to integrate with DBI (for example, once it's finished guessing, to actually create the table and pump the data in), or if you plan to add such a feature, then I'd recommend something in the "DBIx::*" namespace, which is generally used for extensions to DBI.

    use Moops; class Cow :rw { has name => (default => 'Ermintrude') }; say Cow->new->name

      Thank you for the name suggestion! I usually come up with three level names, but I am under the impression that they are considered too unwieldly by other people more adept with naming things. But I like the name...

      I try to keep the functionality of generating SQL and executing SQL separate, as usually I prefer to review DDL and manually run it with a user that has the appropriate permissions. DBI might still be needed to import the appropriate SQL_ type constants, but that wouldn't imply the DBIx namespace for me.

        Perhaps you can use two different names. One for the actual name, being of three or more parts, another to group such things together. That is:

        SQL::Lazy (or just, Lazy, first adding Lazy::SQL). Then adding more modules to it be to truly lazy, (and eventually renaming it to Perl.:P )

        I think this is a good case for a 3-part name. I could see other modules ending up in the SQL::Type bucket. And SQL::Type::Guess is quite short and seems pretty precisely "on the nose" at the purpose of this module.

        - tye        

Re: RFC: Name and/or API for module ("Type::FromData") Released as SQL::Type::Guess
by Corion (Pope) on May 02, 2014 at 18:45 UTC

    The module has been released as SQL::Type::Guess.

    And, as usual, just after the release, I realize that I haven't documented the constructor outside of the synopsis. That's what v0.02 is for.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1081705]
Approved by GotToBTru
Front-paged by GotToBTru
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chanting in the Monastery: (8)
As of 2014-07-29 07:51 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (211 votes), past polls