Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

pluggable/dynamic data processing/munging/transforming module?

by rwstauner (Acolyte)
on Nov 16, 2010 at 18:44 UTC ( #871798=perlquestion: print w/ replies, xml ) Need Help??
rwstauner has asked for the wisdom of the Perl Monks concerning the following question:

Oh wise and varied monks, I come seeking the existence of yet unknown modules.

I have to clean up some gross, ancient code at $work, and before I try to make a new module I'd love to use an existing one if anyone knows of something appropriate.

At runtime I am parsing a file to determine what processing I need to do on a set of data.

If I were to write a module I would try to do it more generically (non-DBI-specific), but my exact use case is this:

I read a SQL file to determine the query to run against the database. I parse comments at the top and determine that
  • column A needs to have a s/// applied,
  • column B needs to be transformed to look like a date of given format,
  • column C gets a sort of tr///.
  • Additionally things can be chained so that column D might s///, then say if it isn't 1 or 2, set it to 3.

So when fetching from the db the program applies the various (possibly stacked) transformations before returning the data.

Currently the code is a disgustingly large and difficult series of if clauses processing hideously difficult to read or maintain arrays of instructions.

So what I'm imagining is perhaps an object that will parse those lines (and additionally expose a functional interface), stack up the list of processors to apply, then be able to execute it on a passed piece of data.

Optionally there could be a name/category option, so that one object could be used dynamically to stack processors only for the given name/category/column.

A traditionally contrived example:

$obj = $module->new(); $obj->parse("-- greeting:gsub: /hi/hello"); # don't say "hi" $obj->parse("-- numbers:gsub: /\D//"); # digits only $obj->parse("-- numbers:exchange: 1,2,3 one,two,three"); # then spell +out the numbers $obj->parse("-- when:date: %Y-%m-%d 08:00:00"); # format like a date, +force to 8am $obj->stack(action => 'gsub', name => 'when', format => '/1995/1996/') +; # my company does not recognize the year 1995. $cleaned = $obj->apply({greeting => "good morning", numbers => "t2", w +hen => "2010116"});

Each processor (gsub, date, exchange) would be a separate subroutine. Plugins could be defined to add more by name.

$obj->define("chew", \&CookieMonster::chew); $obj->parse("column:chew: 3x"); # chew the column 3 times

So the obvious first question is, does anybody know of a module out there that I could use? About the only thing I was able to find so far is Hash::Transform, but since I would be determining which processing to do dynamically at runtime I would always end up using the "complex" option and I'd still have to build the parser/stacker.

Is anybody aware of any similar modules or even a mildly related module that I might want to utilize/wrap?

If there's nothing generic out there for public consumption (surely mine is not the only one in the darkpan), does anybody have any advice for things to keep in mind or interface suggestions or even other possible uses besides munging the return of data from DBI, Text::CSV, etc?

If I end up writing a new module, does anybody have namespace suggestions? I think something under Data:: is probably appropriate... the word "pluggable" keeps coming to mind because my use case reminds me of PAM, but I really don't have any good ideas...

  • Data::Processor::Pluggable ?
  • Data::Munging::Configurable ?
  • I::Chew::Data ?

Comment on pluggable/dynamic data processing/munging/transforming module?
Select or Download Code
Re: pluggable/dynamic data processing/munging/transforming module?
by aquarium (Curate) on Nov 16, 2010 at 23:40 UTC
    sounds like not just the program is a hack, but also the table design. in my opinion it should be so that if any transformation is to be applied to any column, it should be capable of being applied to all values/rows of the column. so you should end up with (if any) just functions that apply to certain columns. I say "if" because there's quite a lot you can achieve directly in SQL code, which would obviate the need to procedurally apply functions row at a time. if you could achieve such an outcome, it would be much cleaner all round.
    the hardest line to type correctly is: stty erase ^H

      Thanks for the suggestion. I agree that fixing the data at the source would be optimal, but it doesn't apply to my current situation.

      The entire purpose of my application is to pull data from (various) outside sources and bring it inside to save it in our database.

      "Cleaning" the data is a necessary part of the process.

      some examples:

      • '1969-12-31 23:59:59' is a dummy value and not an actual date (think -1), so I'd prefer to transform it to NULL before filling my own database with garbage. (But this transformation only applies to one external source.)
      • 20101015 is an integer, not a date. I'd prefer '2010-10-15'. (This example is obviously from a different source.)
      • 'D129', '  D129', 'D129  ' all mean the same thing. I'd prefer the trimmed version.
      • The color 'MAROO' probably means 'Maroon', but looks a little silly.
Re: pluggable/dynamic data processing/munging/transforming module?
by rwstauner (Acolyte) on Nov 17, 2010 at 23:14 UTC

    I just found Data::Transform on CPAN (somehow I missed that one) and it looks somewhat similar to what I'm looking for.

    I need to investigate to see if it is actually tied to POE, but I might be able to simply implement my own Map functions and use the Data::Transform::Stackable module.

      Just found Rule::Engine which also looks somewhat similar to what I was looking for.

      This one is brand new. I will investigate it as well.

Re: pluggable/dynamic data processing/munging/transforming module?
by rwstauner (Acolyte) on Jan 07, 2011 at 19:36 UTC

    Thanks to everyone for their thoughts.

    The short version: After trying to adapt a few existing modules I ended up abstracting my own: Sub::Chain. It needs some work, but is doing what I need so far.

    The long version: (an excerpt from the POD)

    =head1 RATIONALE

    This module started out as Data::Transform::Named, a named wrapper (like Sub::Chain::Named) around Data::Transform (and specifically Data::Transform::Map).

    As the module was nearly finished I realized I was using very little of Data::Transform (and its documentation suggested that I probably wouldn't want to use the only part that I was using). I also found that the output was not always what I expected. I decided that it seemed reasonable according to the likely purpose of Data::Transform, and this module simply needed to be different.

    So I attempted to think more abstractly and realized that the essence of the module was not tied to data transformation, but merely the succession of simple subroutine calls.

    I then found and considered Sub::Pipeline but needed to be able to use the same named subroutine with different arguments in a single chain, so it seemed easier to me to stick with the code I had written and just rename it and abstract it a bit further.

    I also looked into Rule::Engine which was beginning development at the time I was searching. However, like Data::Transform, it seemed more complex than what I needed. When I saw that Rule::Engine was using (the very excellent) Moose I decided to pass since I was doing work on a number of very old machines with old distros and old perls and constrained resources. Again, it just seemed to be much more than what I was looking for.

    =cut

    As for the "parse" method in my original idea/example, I haven't found that to be necessary, and am currently using syntax like

    $chain->append($sub, \@arguments, \%options)

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://871798]
Approved by marto
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others musing on the Monastery: (17)
As of 2014-10-30 14:30 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    For retirement, I am banking on:










    Results (208 votes), past polls