|Problems? Is your data what you think it is?|
The Eternal "filter.pl"by Voronich (Hermit)
|on Aug 25, 2011 at 16:01 UTC||Need Help??|
I mentioned this on CB a couple days ago (along with everything else under the sun, no doubt.) But I figured I should write it up.
For the last 20 years my work programming in various industries has always been accompanied with a responsibility for doing ad hoc data analysis. It doesn't matter what the industry is, there's always some task that sounds something like this:
(note: the examples are all complete hogwash. But their level of semantic complexity is pretty much dead on.)
"Hey Mike, we need the unique customers in the feed file that didn't have any billing activity in yesterdays file (but still existed) and do in today's file."
Fine. That's easily solved with:
Or something thereabouts.
This is GREAT. If the data is in a database, or can be PUT in a database. Or if you have a database to put it in. But what if you don't? what if all you have is a pair of files about 3gig each, containing millions of records each, in a questionableformat, like..."date~customer category~customer ID,foofield,billing amount,transaction id"
Ok, sure. Write a one-shot script that pulls the lookup list from the "yesterday" by reading customers into a hash, aggregating billing amounts then junking everything >0 and then passes it against the second file. Again, not a big deal. Now you have the parse routine for that format kicking around someplace and a one-shot script that you called "filter.pl" because you wrote it while listening to the conference call. You send the results over to whomever and they say "hmm... this isn't right. What are the aggregate bill numbers of these?"
"Hold on, I can get you that in a sec..." you say. After all, a couple levels of management are on the call and you're coming off like teh 1337 script-fu master.
Ok. cp filter.pl filter.old.pl
The original "lookup aggregator" function is almost what you need, so you pull that, remove the "junk everything >0" and feed it the list from the previous script, then point it at the second file again.
Bang. Another pass through the file and you have "customer: aggregate billing" and off it goes.
"erm... this isn't right. How does this compare against the downstream feed? This isn't what we sent."
Now the downstream feed is a different format. Still as easily parseable, but sufficiently different to be annoying. So you copy filter.pl to filter.old2.pl and start again, pulling bits from here and there, writing another filtering function, another parser, etc.
This is my last two weeks (and a good part of my last 20 years.) It includes things like "well, we want to parse the log files as they're generated for error messages on that customer, then look it up in the FooFile and email all that if the Frobnotz is null."
I reach for emacs and start "FilterStuff.pm" about once a month. then I start planning it out in my head and I come up with... "a driver script that takes a rule file containing the input sources and their respective data formats along with a list of comparison rules comprised of set operators, basic math and aggregate functions that can be run, producing a data set that...."
Maybe I just have to organize my code better and put up a plaque that says "There No Such Thing As A One Shot Script".
But there are bunch of really common filters and rules that I come across:
The commonalities in the data itself are a bit disturbing:
But I still find myself duplicating effort repeatedly. For this particular set of files I have no less than 19 separate incarnations of "filter.pl". That's just stupid.
What am I looking for here? For any module to be sufficiently abstract to be useful, it would require almost the same amount of work to set up the "parsing rules and selection criteria" as it does to do so in a straight script.
Now I haven't touched on performance a bit. That frankly is because with these things? I just don't care. If I'm caught in a spot where I'm doing quadratic time inner loops (or worse, as I frequently have 4 or 5 files involved) then I either break it up or I push back and say "that's stupid, it would take months."
It seems to me that the only truly common code is "parse this datasource into a stream of records, where 'record' is a list of consistently sequenced fields corresponding to a table definition. A lot of the pre-canned options I've come across seem (and I may be dead off on this) to want random access primitives.
So is the only thing I'm going to get out of this a standard format for plugging record format parsers into a canned looping construct?