Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?

Re: The Eternal ""

by armstd (Friar)
on Aug 25, 2011 at 20:47 UTC ( #922450=note: print w/replies, xml ) Need Help??

in reply to The Eternal ""

Sounds similar to a project I worked on a few years ago. Describing typical use-models of our ClearCase workspaces. Or a bug. Like, how many builds in the workspace, for how many bugs, and how long did they take? We would track every interesting tool transaction in the workspace, trying to optimize productivity for our team of 4000 developers. We had bug records available from a custom web service, backed by Siebel, which was backed by Oracle. So no DBI there. Our usage was tracked in Oracle, behind a different custom webservice, so no DBI there either, and other datasources using a more traditional SQL data store, yay DBI. We kept periodic full dumps of the usage data in compressed text format for I/O efficiency, indexed by time. Millions of rows on that one, but sometimes we needed rows not downloaded yet, so off to the webservice for those.

The challenge was taking each and every unique datasource across all of our development tools, and providing the capability to make cohesive reports, leveraging one or more of the data sources, depending on the query.

Abstraction is key. Abstract your datasources, so you can query them all using the same API. Abstract your data model. A user object might consist of different fields from different tables in different data sources, and should have information on how to join a user across those disparate data sources.

Finally, abstract your filtering. When you have a User object that knows its 'username' or 'userid' attributename in each datasource, it's easy to

if( $user->getAttr( 'username' ) eq $filter->{'username'} )

or even more flexibly and generically,

&{$filter->{$field}->{'filterCallback'}}( $recordObj )

Your caller can provide the disqualifying logic like

$queryObj->filter( 'field' => 'username', $objList => \@records, 'callBack' => sub { $_[0]->getAttr( 'username' ) eq $value ) } )

Being able to use Perl to do so is very powerful indeed. Once you can chop up, correlate, and filter your data, MapReduce can drive the query itself.

Back in reality though... there are middleware products that do this kind of data source and reporting abstraction. Oracle Reports comes to mind. I'm sure Siebel has one too. Maybe there's a free one too by now. They'll allow you to hook in arbitrary data sources, describe the schema "generically", and tie it all up so you can make your arbitrary reports. I just never worked for a company that wanted to pay for those products when they had a "free" tool developer on staff. Good for me, not necessarily best for the company. Requirements always change, generic tools are typically better for that than custom in-house solutions.


Replies are listed 'Best First'.
Thinking out loud (was: Re^2: The Eternal "")
by Voronich (Hermit) on Aug 29, 2011 at 14:00 UTC

    There was a time when I would have set out to do exactly that. But the problem I find with that approach is that it very quickly races past the point of diminishing returns, eventually becoming like J2EE where you have a two line program and 50k of 'configuration'. But it does force me to think about what I do and don't want to abstract away.

    I definitely want to abstract the simple "parse to records" logic into something that I can iterate across because the semantics are there and clean in perl already.

    But the iteration driver itself can be tricky and I'm not sure it shouldn't just have a few different permutations:

    • Iterate across record set, applying 'function'
    • Iterate across record set A, then applying each record to something in record set B via a provided function. (i.e. if A is a lookup list for B)
    • Iterate across both sets, doing something to both.

    Nah, see that's already messy as hell. Well ok. Maybe there's a reasonable way to make that work; or perhaps it's just not as bad as the description. (Actually I suspect that they're all degenerate cases of the same construct.) I'll have to see once I get everything else out of the way.

    Having thought about this for some time now I'm pretty sure I want the actual "filtering function" to be a plain perl function that takes a pair of records. Perl actually does code well and there's no reason for me to abstract everything SO far that I end up having to write a programming language.

    Because of that, I'm fighting even with the idea of whether or not a record parsing construct should provide column/datatype information. What good would it REALLY do? The filter function is likely to be very specific to the task at hand, so documentation (in the form of well-named variables, etc) can very effectively be contained therein and operations on the data would require that I re-interpret it. That also sounds a lot like writing a programming language (which I'm sure is fun, but I haven't come up with a good reason to do yet.)

    I think I may have thought my way as far as I'm going to think without writing more code.


      Hi. I'm glad to see it's not just me that's been thinking along these lines :)

      Not that I've come to any real conclusions, but I think that it's all just set theory. And the way to describe it may be those terms. i.e. sets, unions, intersections and complements.

      If all the records fit in memory in a hash it's reasonably easy to describe a set that passes some test function

      my @set_1 = grep { func($_) } keys %records_1;

      then you can describe the relationship you're looking for in those terms.

      So you might end up with something like this:-

      set1 = set of all records_1 that pass func() set2 = set of all records_2 that fail func() results = intersection of set1 and set2; set3 = ... etc
      and so you can describe any arbitrary combination of sets.

      We will needs some support functions , but that's 'just a simple matter of programming' ;)

      The main problem, as I see it , is how to deal with data sets that are too big to fit into memory. My only thought is to keep a hash of the (key, file offset) and re-parse each record on demand. Or maybe Tie::File could do the job ?

      Arrgh! - you're making me want to try code this again :)


        Well it won't fit in memory, but that's neither here nor there really. Plus I don't need separate pre-qualifiers for source record sets, which is nice.

        Hmm... closing in on an idea. Gonna go code some tests. I'll transmorgrify this into a CuFP yet!


Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://922450]
and the pool shimmers...

How do I use this? | Other CB clients
Other Users?
Others pondering the Monastery: (9)
As of 2017-02-22 21:04 GMT
Find Nodes?
    Voting Booth?
    Before electricity was invented, what was the Electric Eel called?

    Results (335 votes). Check out past polls.