Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?

OT: Data Warehousing Strategies

by Booger (Pilgrim)
on Aug 27, 2006 at 10:25 UTC ( #569873=perlquestion: print w/replies, xml ) Need Help??
Booger has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks:

The company I work for is considering undergoing a fairly aggressive redesign of one of our product lines and I have come upon a design issue that I would appreciate some input on.

The product is something that we host and provide to multiple clients. Most of the functionality for each client is similar if not identical and up until now our databases have been organized such that each client has its own set of tables to store application data in (the columns in one set of tables are identical to columns in another set of tables - we've just duplicated the tables for each client).

To be honest I'm not a DBA but I've been around a number of large-scale databases in my time and I think this design is broken and it makes administering the database a real pain since you have to use an automated script to modify each and every table (there are a lot of them) when you want to extend or make a change to the way the application works. This design also causes some fairly ugly code because we have to take into account a clients "table" rather than just a client_id in a table shared by all clients.

One of the challenges in our redesign is reconciling all of this data into a unified set of tables (such that we no longer have tables like "client123_contacts" or "client123_products" but just "contacts" and "products"). I think this will go a long way to giving us a more compact and consistent database.

Here's the bit that I'm scratching my head about...

One of the stated goals of the redesign is to create a system whereby clients can "extend" the database to store bits and pieces of data that are specific to their situation (we will provide a specific set of generic extensions). For example, we might provide a table called "produce" that stores various bits of data about the produce you offer at your store. If you are the client you might want to extend our produce offering buy storing off "produce attributes" that we don't provide for in our default offering (such as sale prices specific to different kinds of produce - or whatever). You would do this by assigning the sale price to an available column (in this case a double) in the extensions table.

The trouble I'm having is that each client will want to extend their data set differently. Some will want to use the extension columns we provide for one set of data while another will want to use the extension columns for a different set of data. The relationship between the default "produce" table we provide and the extensions is always a 1-to-1 but the actual data in the extensions table will differ between clients.

There is a business layer above the data layer that allows us to define a how the data is presented within the application, untainted, ordered on the screen and so forth (this same business layer is what allows clients to extend their default data set offering - keep in mind that they do not actually change the schema but can select from a default set of columns to store their client-specific data in).

What I'm not sure about and would like some input on is whether I should create a unified set of tables in one database (i.e., all clients use the same set of tables within each database - this means that each client uses the same set of extension tables but would store different data for different purposes within those extension tables and thereby possibly violating 2NF but we try to design the default tables such that this isn't a usually problem) OR if I should create set of tables in client-specific databases (whereby each client will have their own database, the schemas in each database being a duplicate of the others). The latter option might not get me very far because I wouldn't really be gaining anything in terms of ease-of-administration.

I may not be giving you enough information about our problem domain, please let me know if there is something I'm leaving out.

So, what would you do in this sort of situation? (Besides run in circles, scream and shout).

There is one more thing that may further complicate the issue: some of our clients are related in a hierarchal fashion and share responsibility over their data with collective data filtering "upwards" in a predefined hierarchy (e.g., if you look at a chain of stores you can see an overall list of products and quantity of products vs. those products only available at a single store - it depends on your perspective).

Thanks for your input guys!

Replies are listed 'Best First'.
Re: OT: Data Warehousing Strategies
by roboticus (Chancellor) on Aug 27, 2006 at 13:56 UTC
    I don't know what database you're using, so this suggestion may not be useful to you. As I see it, mixing your clients data together into a unified set of tables might be a big can of worms. How about giving each client their own database with identical table structures? That way, you can partition your data with the login process. You can use the same code without having to put in a customer (pre|suf)fix on your table names.

    If you have some tables with the same data for all customers, you can split that out into another database and access them directly via qualified references or database linking (depending on which database you use).

    Regarding the custom data elements, I'd put those in appendix tables so you can keep the columns used by all customers separate from the columns used by some other customers. Something like:

    create table customers ( customer_id int identity primary key, name varchar(64) ) create table customer_extras ( customer_id int primery key foreign key customers(customer_id), shoe_size int )

    One advantage of the multiple databases strategy is you can easily move databases to multiple servers should the need arise for particular security and/or performance needs.


      We're using MySQL at the moment. I would personally prefer using PostgreSQL but the other developers aren't familiar with it and depend too highly on phpMyAdmin.

      The new version of the application will run on MySQL 5 so I'll have all of its features at my disposal.

      I'm a little hesitant to split into several databases (although you provide good reasons for doing so) but only because I don't know what the ramifications are when it comes to doing a select across multiple databases (something I've very rarely done). We would have to do this because we have to combine customer data quite frequently when presenting it to the end user.

      Thanks for your input!

        I'm not terribly familiar with MySQL. Most of my experience is with Sybase and MS SQL Server (which are very closely related). In these, you simply prefix the table name with the database name and owner to access a table in a different database. (Assuming, of course, that the login has permissions on both databases.)

        Suppose for example that you have two databases, common and cust1. The common table has table states which contains the full name of each state (full_name), keyed by its abbreviation (ST). The cust1 database has table customers which has the customer number (Cust_ID) and state (ST). If you're logged into cust1 and want to count all the customers by state and list the state name, you'd do something like this:

        select common.dbo.states.full_name, count(customers.*) from customers join common.dbo.states on customers.ST = common.dbo.states.ST group by common.dbo.states.full_name order by common.dbo.states.full_name

        So it's a bit verbose. But there's another trick you can use in these two databases: You can hide the table in the remote database behind a view so it looks just like a table in the current database. Something like this:

        create view states as select ST, full_name from common.dbo.states
        Now you can perform the previous operation more naturally:

        select states.full_name, count(customers.*) from customers join states on customers.ST = states.ST group by states.full_name order by states.full_name

        As I said earlier, I'm not very familiar with MySQL (not having used it since about v3.1), but I'd imagine that it offers similar capabilities. Hopefully a knowledgeable MySQL or PostgreSQL user will chime in with similar operations, or improved suggestions.


Re: OT: Data Warehousing Strategies
by aufflick (Deacon) on Aug 28, 2006 at 08:48 UTC

      These blessed table names are the result of a developer who was in waaaaaay over his head.

      I think it ironic that the design bears so much resemblance to an "enterprise" system.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://569873]
Approved by tye
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others chanting in the Monastery: (5)
As of 2018-02-21 19:50 GMT
Find Nodes?
    Voting Booth?
    When it is dark outside I am happiest to see ...

    Results (287 votes). Check out past polls.