Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

Re: The Eternal "filter.pl"

by moritz (Cardinal)
on Aug 25, 2011 at 16:40 UTC ( [id://922406]=note: print w/replies, xml ) Need Help??


in reply to The Eternal "filter.pl"

I've done similar stuff many times over, though from your description it seems that you've done it much more often than me :-). I can certainly relate to that feeling that the repetition is bothersome, but often not quite enough to attack the problem properly.

It seems to me that the only truly common code is "parse this datasource into a stream of records, where 'record' is a list of consistently sequenced fields corresponding to a table definition.

To you that's not much, but for others that's enough to start a new hype around "map/reduce". The parsing step is basically a "map", and the filtering and aggregation is a "reduce".

As for your actual problem:

Or if you have a database to put it in. ... what if all you have is a pair of files about 3gig each

Can't you get a developer machine with a few hundred gig of free disc space, and set up your own private database into which you can import such files? I mean, come on, 2x 3gig ain't that much. The import will take some time, but you said yourself that time isn't the problem.

Or maybe you want something like an SQL engine that works on in-memory objects? If yes, DBI::DBD::SqlEngine looks promising, though I've never used it before.

Replies are listed 'Best First'.
Re^2: The Eternal "filter.pl"
by Voronich (Hermit) on Aug 25, 2011 at 16:49 UTC

    I hate it. I hate it. I hate it. I hate that answer. "just get a database." Not you! I hate it because it IS the right answer. It's ALWAYS the right answer (for sufficient values of "always".) But it's a near firing offence if I do it as a skunkworks project and they won't allocate it willingly. Even if I were to add "nonclashing table names" into a currently unused dev database I'd get shot. It's the kind of illogic that causes stress fractures to develop in my skull as the steam escapes.

    Corion said the same thing re: map reduce. I've heard of it. But I know nothing about it as it seems always to be accompanied with that pie-eyed silver-bullet attitude I've become alergic to.Perhaps it's time to give it a serious look see.

    Me

      If your desktop box isn't a wimpy laptop or dumb terminal, you should be able to run a temporary database locally, without giving it any access from the network or using external resources. Worst case, would you be allowed to use something like DBM::Deep with a flat file?

      Collect your set of scripts for importing the various types of datafile you commonly see, and run those on demand during the meeting. Once the meeting is over, drop the whole database.

      So when the meeting happens, you fire up the database, doubleclick your "import_feedfile.pl", and then run the first query. Question 2, you change the query. Question 3, you doubleclick "import_downstream_feedfile.pl", and then run your new query. Rinse and repeat.

      IE: Don't have scripts for one-off filtering, only keep scripts for importing the various files you'll see more than once, and do your actual queries with SQL.

        Our desktops are actually little more than X connections with MS Office running. Attempting to install anything that trips microsoft's "installation" system is locked out, as are writes to most of the local disc and any external devices.

        I like the idea of using something like DBM::Deep. But have never used any of those, so it's new stuff. If I can inject custom parse routines into them (the data formats are never quite so simple as plain csv, and certainly not fixed-record files) then it's definitely a candidate.

        Me
      ... it's a near firing offence if I do it as a skunkworks project and they won't allocate it willingly.

      Can you use DBD::SQLite?

        Certainly not legitimately. The allergy to external code here is silly. But frankly I need to solve the problem, so that's not going to stop me if I can help it.

        But that's a question best answered by experimentation. The trick with sqlite is getting it installed in the right place.

        Me

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://922406]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others meditating upon the Monastery: (4)
As of 2024-04-19 11:30 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found