Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot

Efficient and modular filtering

by roju (Friar)
on May 27, 2004 at 15:32 UTC ( [id://356941] : perlquestion . print w/replies, xml ) Need Help??

roju has asked for the wisdom of the Perl Monks concerning the following question:

Fellow Monks,

I have a large dataset that needs conversion from one format to another. As there are several minor input formats and output formats, I've seperated the input filter into input modules, and the output filter into output models.

So right now my program looks (simplified) something like

use Input::X; use Output::Y; my @data = Input::X::read(); print Output::Y::get_output (@data);

Now, the datasets are large, and the last test run used something like 700 megs of ram. While that's not so terrible, it's not great either.

Is there a way to preserve modularity and speed things up? It is possible to run like a typical unix filter and output a line of output for every line of input, however I want to avoid dependencies in the module code. Is there a 'standard' way to do so?

Again, I'd love to have it run as a tradditional filter, but I'd also like to do so with minimal coupling between the input and output sides.

janitored by ybiC: Corrected "effiecient" mis-spelling in node title

Replies are listed 'Best First'.
Re: effiecient and modular filtering
by eric256 (Parson) on May 27, 2004 at 15:47 UTC

    Instead of having your input filter return the whole set of records have it just return a single record each time. So everytime its called it gets the next record and if its out of records it returns false. Then you could do something like

    while ($new = Input::X::read()) { print Output::Y::get_output ($new); } }

    Your output filter needs to expect a single record in that case as well.

    Eric Hodges
      I like it. Something similar occured to me over lunch. It seems rather obvious now. Sigh. Thanks :)
Re: Efficient and modular filtering
by graff (Chancellor) on May 28, 2004 at 04:07 UTC
    Depending on your situation, your various input and output modules might work just as well if you use them as separate filters on the command line. Not knowing how complicated the filtering really is, something like this might (or might not) be a sensible model for usage:
    perl -MInput::X -e 'print while($_=Input::X::read)' input.x | \ perl -MOutput::Y -pe '$_ = Output::Y::get_output( $_ ) > output.y
    In other words, if you have several different kinds of inputs and outputs, and your modules already address each one separately, then the rest of the "scripting" can be done with the shell command line.

    Maybe this would mean doing a little extra work to 'stingify' the intermediate data structure that is shared by the input and output modules. Maybe that would be worthwhile (or not).