Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?

The Eternal ""

by Voronich (Hermit)
on Aug 25, 2011 at 16:01 UTC ( #922393=perlmeditation: print w/replies, xml ) Need Help??

I mentioned this on CB a couple days ago (along with everything else under the sun, no doubt.) But I figured I should write it up.

For the last 20 years my work programming in various industries has always been accompanied with a responsibility for doing ad hoc data analysis. It doesn't matter what the industry is, there's always some task that sounds something like this:

(note: the examples are all complete hogwash. But their level of semantic complexity is pretty much dead on.)

"Hey Mike, we need the unique customers in the feed file that didn't have any billing activity in yesterdays file (but still existed) and do in today's file."

Fine. That's easily solved with:

select distinct today.customer from feed as yesterday, feed as today where yesterday.customer = today.customer and = getdate() and = getdate() -1 and yesterday.billing = 0 and today.billing != 0

Or something thereabouts.

This is GREAT. If the data is in a database, or can be PUT in a database. Or if you have a database to put it in. But what if you don't? what if all you have is a pair of files about 3gig each, containing millions of records each, in a questionableformat, like...

"date~customer category~customer ID,foofield,billing amount,transaction id"

Ok, sure. Write a one-shot script that pulls the lookup list from the "yesterday" by reading customers into a hash, aggregating billing amounts then junking everything >0 and then passes it against the second file. Again, not a big deal. Now you have the parse routine for that format kicking around someplace and a one-shot script that you called "" because you wrote it while listening to the conference call. You send the results over to whomever and they say "hmm... this isn't right. What are the aggregate bill numbers of these?"

"Hold on, I can get you that in a sec..." you say. After all, a couple levels of management are on the call and you're coming off like teh 1337 script-fu master.

Ok. cp

The original "lookup aggregator" function is almost what you need, so you pull that, remove the "junk everything >0" and feed it the list from the previous script, then point it at the second file again.

Bang. Another pass through the file and you have "customer: aggregate billing" and off it goes.


"erm... this isn't right. How does this compare against the downstream feed? This isn't what we sent."

Now the downstream feed is a different format. Still as easily parseable, but sufficiently different to be annoying. So you copy to and start again, pulling bits from here and there, writing another filtering function, another parser, etc.

This is my last two weeks (and a good part of my last 20 years.) It includes things like "well, we want to parse the log files as they're generated for error messages on that customer, then look it up in the FooFile and email all that if the Frobnotz is null."

I reach for emacs and start "" about once a month. then I start planning it out in my head and I come up with... "a driver script that takes a rule file containing the input sources and their respective data formats along with a list of comparison rules comprised of set operators, basic math and aggregate functions that can be run, producing a data set that...."


Maybe I just have to organize my code better and put up a plaque that says "There No Such Thing As A One Shot Script".

But there are bunch of really common filters and rules that I come across:

  • Values satisfying criteria A in the left file that (don't)? exist in the right hand file.
  • Filtering based on aggregated criteria
  • "just complex enough that I can't use diff" comparisons.

The commonalities in the data itself are a bit disturbing:

  • Very large data sets (individual files of several gigs in size. Sometimes live streaming data, as in running log files.)
  • Read only data (large data sets coupled with fragile environments so copies are not feasible.)
  • Delimited format (comma, pipe, tab, or horrible amalgams. I have one file format with 3 different "top level" delimiting characters.)
  • Some normalization into multiple files.
  • Nonsequenced data (i.e. must not rely on sorting of any kind.)

But I still find myself duplicating effort repeatedly. For this particular set of files I have no less than 19 separate incarnations of "". That's just stupid.

What am I looking for here? For any module to be sufficiently abstract to be useful, it would require almost the same amount of work to set up the "parsing rules and selection criteria" as it does to do so in a straight script.

Now I haven't touched on performance a bit. That frankly is because with these things? I just don't care. If I'm caught in a spot where I'm doing quadratic time inner loops (or worse, as I frequently have 4 or 5 files involved) then I either break it up or I push back and say "that's stupid, it would take months."

It seems to me that the only truly common code is "parse this datasource into a stream of records, where 'record' is a list of consistently sequenced fields corresponding to a table definition. A lot of the pre-canned options I've come across seem (and I may be dead off on this) to want random access primitives.

So is the only thing I'm going to get out of this a standard format for plugging record format parsers into a canned looping construct?

Replies are listed 'Best First'.
Re: The Eternal ""
by moritz (Cardinal) on Aug 25, 2011 at 16:40 UTC

    I've done similar stuff many times over, though from your description it seems that you've done it much more often than me :-). I can certainly relate to that feeling that the repetition is bothersome, but often not quite enough to attack the problem properly.

    It seems to me that the only truly common code is "parse this datasource into a stream of records, where 'record' is a list of consistently sequenced fields corresponding to a table definition.

    To you that's not much, but for others that's enough to start a new hype around "map/reduce". The parsing step is basically a "map", and the filtering and aggregation is a "reduce".

    As for your actual problem:

    Or if you have a database to put it in. ... what if all you have is a pair of files about 3gig each

    Can't you get a developer machine with a few hundred gig of free disc space, and set up your own private database into which you can import such files? I mean, come on, 2x 3gig ain't that much. The import will take some time, but you said yourself that time isn't the problem.

    Or maybe you want something like an SQL engine that works on in-memory objects? If yes, DBI::DBD::SqlEngine looks promising, though I've never used it before.

      I hate it. I hate it. I hate it. I hate that answer. "just get a database." Not you! I hate it because it IS the right answer. It's ALWAYS the right answer (for sufficient values of "always".) But it's a near firing offence if I do it as a skunkworks project and they won't allocate it willingly. Even if I were to add "nonclashing table names" into a currently unused dev database I'd get shot. It's the kind of illogic that causes stress fractures to develop in my skull as the steam escapes.

      Corion said the same thing re: map reduce. I've heard of it. But I know nothing about it as it seems always to be accompanied with that pie-eyed silver-bullet attitude I've become alergic to.Perhaps it's time to give it a serious look see.


        If your desktop box isn't a wimpy laptop or dumb terminal, you should be able to run a temporary database locally, without giving it any access from the network or using external resources. Worst case, would you be allowed to use something like DBM::Deep with a flat file?

        Collect your set of scripts for importing the various types of datafile you commonly see, and run those on demand during the meeting. Once the meeting is over, drop the whole database.

        So when the meeting happens, you fire up the database, doubleclick your "", and then run the first query. Question 2, you change the query. Question 3, you doubleclick "", and then run your new query. Rinse and repeat.

        IE: Don't have scripts for one-off filtering, only keep scripts for importing the various files you'll see more than once, and do your actual queries with SQL.

        ... it's a near firing offence if I do it as a skunkworks project and they won't allocate it willingly.

        Can you use DBD::SQLite?

Re: The Eternal ""
by armstd (Friar) on Aug 25, 2011 at 20:47 UTC

    Sounds similar to a project I worked on a few years ago. Describing typical use-models of our ClearCase workspaces. Or a bug. Like, how many builds in the workspace, for how many bugs, and how long did they take? We would track every interesting tool transaction in the workspace, trying to optimize productivity for our team of 4000 developers. We had bug records available from a custom web service, backed by Siebel, which was backed by Oracle. So no DBI there. Our usage was tracked in Oracle, behind a different custom webservice, so no DBI there either, and other datasources using a more traditional SQL data store, yay DBI. We kept periodic full dumps of the usage data in compressed text format for I/O efficiency, indexed by time. Millions of rows on that one, but sometimes we needed rows not downloaded yet, so off to the webservice for those.

    The challenge was taking each and every unique datasource across all of our development tools, and providing the capability to make cohesive reports, leveraging one or more of the data sources, depending on the query.

    Abstraction is key. Abstract your datasources, so you can query them all using the same API. Abstract your data model. A user object might consist of different fields from different tables in different data sources, and should have information on how to join a user across those disparate data sources.

    Finally, abstract your filtering. When you have a User object that knows its 'username' or 'userid' attributename in each datasource, it's easy to

    if( $user->getAttr( 'username' ) eq $filter->{'username'} )

    or even more flexibly and generically,

    &{$filter->{$field}->{'filterCallback'}}( $recordObj )

    Your caller can provide the disqualifying logic like

    $queryObj->filter( 'field' => 'username', $objList => \@records, 'callBack' => sub { $_[0]->getAttr( 'username' ) eq $value ) } )

    Being able to use Perl to do so is very powerful indeed. Once you can chop up, correlate, and filter your data, MapReduce can drive the query itself.

    Back in reality though... there are middleware products that do this kind of data source and reporting abstraction. Oracle Reports comes to mind. I'm sure Siebel has one too. Maybe there's a free one too by now. They'll allow you to hook in arbitrary data sources, describe the schema "generically", and tie it all up so you can make your arbitrary reports. I just never worked for a company that wanted to pay for those products when they had a "free" tool developer on staff. Good for me, not necessarily best for the company. Requirements always change, generic tools are typically better for that than custom in-house solutions.


      There was a time when I would have set out to do exactly that. But the problem I find with that approach is that it very quickly races past the point of diminishing returns, eventually becoming like J2EE where you have a two line program and 50k of 'configuration'. But it does force me to think about what I do and don't want to abstract away.

      I definitely want to abstract the simple "parse to records" logic into something that I can iterate across because the semantics are there and clean in perl already.

      But the iteration driver itself can be tricky and I'm not sure it shouldn't just have a few different permutations:

      • Iterate across record set, applying 'function'
      • Iterate across record set A, then applying each record to something in record set B via a provided function. (i.e. if A is a lookup list for B)
      • Iterate across both sets, doing something to both.

      Nah, see that's already messy as hell. Well ok. Maybe there's a reasonable way to make that work; or perhaps it's just not as bad as the description. (Actually I suspect that they're all degenerate cases of the same construct.) I'll have to see once I get everything else out of the way.

      Having thought about this for some time now I'm pretty sure I want the actual "filtering function" to be a plain perl function that takes a pair of records. Perl actually does code well and there's no reason for me to abstract everything SO far that I end up having to write a programming language.

      Because of that, I'm fighting even with the idea of whether or not a record parsing construct should provide column/datatype information. What good would it REALLY do? The filter function is likely to be very specific to the task at hand, so documentation (in the form of well-named variables, etc) can very effectively be contained therein and operations on the data would require that I re-interpret it. That also sounds a lot like writing a programming language (which I'm sure is fun, but I haven't come up with a good reason to do yet.)

      I think I may have thought my way as far as I'm going to think without writing more code.


        Hi. I'm glad to see it's not just me that's been thinking along these lines :)

        Not that I've come to any real conclusions, but I think that it's all just set theory. And the way to describe it may be those terms. i.e. sets, unions, intersections and complements.

        If all the records fit in memory in a hash it's reasonably easy to describe a set that passes some test function

        my @set_1 = grep { func($_) } keys %records_1;

        then you can describe the relationship you're looking for in those terms.

        So you might end up with something like this:-

        set1 = set of all records_1 that pass func() set2 = set of all records_2 that fail func() results = intersection of set1 and set2; set3 = ... etc
        and so you can describe any arbitrary combination of sets.

        We will needs some support functions , but that's 'just a simple matter of programming' ;)

        The main problem, as I see it , is how to deal with data sets that are too big to fit into memory. My only thought is to keep a hash of the (key, file offset) and re-parse each record on demand. Or maybe Tie::File could do the job ?

        Arrgh! - you're making me want to try code this again :)

Re: The Eternal ""
by deMize (Monk) on Aug 27, 2011 at 17:46 UTC
    The biggest concern I would have is taking that input and being able to accurately normalize it. I can't see your files, but if you could normalize the data into proper table structures, you have a hope in using some sort of software available at your fingertips (maybe even Excel).

    Sure you might not be able to do all your querying at once, but you could build your own MapReduce-like function from your tables.


      As strange as this sounds, normalization is actually a non-issue. ANY problems with the data and all bets are off immediately, as it would mean we have a very serious problem.

      hmm... aGAIN with the MapReduce. *sigh* I reckon I'ma gonna haveta go learn me that.


Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlmeditation [id://922393]
Approved by Corion
Front-paged by Corion
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others surveying the Monastery: (3)
As of 2017-07-28 03:21 GMT
Find Nodes?
    Voting Booth?
    I came, I saw, I ...

    Results (424 votes). Check out past polls.