dwm042 has asked for the wisdom of the Perl Monks concerning the following question:
In DBIx::Class, if I have a table called Towns, and a table called Residents, where Residents are the residents of one of the towns in Towns, then DBIx::Class will create 2 Classes to represent Towns and Residents. If, however, I want to set up a few tables of the Resident kind (Resident_00, _01, etc), to accommodate a few huge towns with a separate table, is there a way to reduce this kind of setup to a two class model in DBIx::Class?
The Towns table would now contain pointers to the specific Residents table to use. I'm very new to DBIx::Class, so I wouldn't be surprised if I'm asking an obvious question.
David.
OT?: Partition the tables? (Re: DBIx::Class with two kinds of otherwise identical table types)
by roboticus (Chancellor) on Sep 15, 2010 at 04:49 UTC
|
dwm042:
I can't answer your DBIx::Class question, as I've never used it. But I'm wondering why you want to split out your residents tables? It seems to me that it would simply make it harder to do useful queries. You could group your residents by putting an index on the town column for residents if you're trying to speed up access to groups of residents of a particular town.
If you're using a database that permits the use of partitioned tables, then you might consider partitioning your residents table. That way, typical queries could use the residents table and see all residents for all towns. But you could access the individual partitions of the table, if need be. In MS SQL Server, you have to build a partitioning function (to tell it which table a particular resident goes to) and it has other criteria (key monotonicity, etc.). I think you could do it with a key of (Town,ResidentID) or some such. I've not built a partitioned table in a while, so I'd have to hit the books again to be sure. But seeing as I'm going to bed (it's nearly 1:00AM here), I think I'll just cough up a couple links Google gave me, and you can review 'em to see if it looks interesting:
Partitioned Tables and Indexes in SQL Server 2005 and SQL SERVER – 2005 – Database Table Partitioning Tutorial.
...roboticus
| [reply] |
|
Roboticus,
Right now partitioning is a hypothetical, a way to handle a few huge towns in a mass of small towns. But since you've given me a name to the method, I can do searches and I found this url, which though elusive, gives me a couple new pointers in Yann's comment.
In other cases, a Dr. Chris Cole was using partitioned tables, because of the sheer size of his data sets (~90M rows/month). Link is here.
Thanks.
David.
| [reply] |
|
dwm042:
The reason I used partitioned tables is because I worked at a financial institution that deals with a *huge* number of transactions each day. We have to keep different levels of transaction information for different amounts of time. So we used partitioning to help manage the volume of data. A brief description follows, to give an illustration of how and why to use partitioned tables:
Requirements:
- We need complete transaction details for 35 days. (Actual req was for 30 days, we kept five additional days for simplifying the monthly summary and leave a bit of elbow room for error recovery.)
- We need transaction summaries for 18 months
- Database needs to be online 24/7
- We had several indexes on the data that take a good amount of time to rebuild
- ...others not mentioned here...
Because of these requirements, we had two tables TxnDtls and TxnSumHist for the details and summaries. We partitioned both tables based on the date. For TxnDtls, we used the day (YYYYMMDD), and for TxnSumHist, we used the month (YYYYMM). (In the remainder of the post, think of YYYYMM and YYYYMMDD as stand-ins for the actual dates.) Our process consisted of roughly:
- Create table TxnDtls_YYYYMMDD
select top 0 * into TxnDtls_YYYYMMDD from TxnDtls
- Bulk load the transaction details into the table (using BCP)
- Build the indexes.
- Update the partition function to eliminate the 35-day-old TxnDtls_YYYYMMDD table and add the new table.
- If it's the first day of the month, then:
- Create the new TxnSumHist_YYYYMM table, summarizing the data
select -- Key fields
D.Merchant_ID, 'YYYYMM' as Billing_Period, D.TxnType, ...
-- Statistics fields
sum(D.Amount) as TxnTotal, count(*) as TxnCount,
...
into TxnSumHist_YYYYMM S
from TxnDtls D
where D.TxnDate between ... and ...
group by D.Merchant_ID, D.TxnType, ...
- Build the indexes
- Update the partition function to eliminate the 19-month-old TxnSumHist_YYYYMM table and add the new one.
- Drop the old TxnSumHist_YYYYMM table
- Drop the old TxnDtls_YYYYMMDD table
By carefully distributing the table and index partitions among your storage systems, you can get surprisingly good performance. (Assuming you have the I/O capacity and enough storage devices to distribute the load to. Towards the end of the project, we used a couple fiber optic cards to connect to a massive storage system that distributed 1.6TB of data over numerous 20GB drives. The performance was stunning!)
If anyone has any specific questions about the system, just ask, and I'll answer as best as I can. But the system was decommissioned about six months ago, and I work at a different company now, so some of the finer details are still leaking away from my memory... ;^)
...roboticus
| [reply] [d/l] [select] |
Re: DBIx::Class with two kinds of otherwise identical table types
by CountZero (Bishop) on Sep 15, 2010 at 06:18 UTC
|
| [reply] |
Re: DBIx::Class with two kinds of otherwise identical table types
by Marshall (Canon) on Sep 15, 2010 at 13:17 UTC
|
I haven't used DBIx:Class. But it appears that you have a clear set of Towns and a clear set of Residents. This sounds to me like a "has a" relationship, not inheritance ("is a"). A town has a resident, a resident can only belong to one town. A resident is not a town, ie resident is not a sub-class of town. The White House and airplanes both have wings, but they are not sub-classes of Wing.
I'm not a relational DB guy by any means, but the each resident in the resident table would have a "pointer" to the town that this resident belongs to - there are various DB words to describe this, but that is what it is.
I got lost with this part: tables of the Resident kind (Resident_00, _01, etc), to accommodate a few huge towns. There will be just 2 tables as you described. One that describes all the towns. One that describes all the residents. Each resident has a pointer to a town at a minimum. It is also possible for performance reasons to have the DB update a list of pointers to residents for each town.
The DB can generate tables like "give me all residents in town X". It is also possible to "flatten the DB", by putting all the data into a single table. I mean like the Town table could have stuff in it like Latitude, Longitude, #of restaurants, etc. The resident table just has a pointer to that info. You can generate a table with all the info. That is sometimes done for read performance reasons. But say if the #of restaurants changes, I've got a problem as lots of fields have to be updated.
I guess in short, a Class representation may not be what you want? A few huge towns would normally mean to me that the resident table has a lot of duplicate pointers to those huge towns. roboticus knows a lot about such situations and I will defer as to partioning, etc.
Update: I am working on an SQL project now and I recommend:
Learning SQL by Alan Beaulieu
the concepts map very directly into the Perl DBI. | [reply] |
|
Marshall,
My first job as a professional in IT involved writing C++ wrappers for embedded SQL, so handling SQL statements isn't an issue for me. I'm certain I can craft decent enough SQL to do the job.
But like most Perl folks, I'm interested in the new technology and so am designing a Catalyst app in my mind as I work through the Catalyst tutorial (recommended, btw, as a great intro to TT, DBIx, Moose, and Catalyst).
So the question, in abstract terms, is at what point do you need to abandon the two table representation of data and go to a partitioned table implementation? Understand, from my POV, anyone who has ever thought of a file system layout understands where a partitioned table set is headed. You have a table with a column of pointers. The pointers point to the table to use. You would only do this when performance of a single table representation becomes an issue.
I've had and maintained databases with a single huge table. It's no fun when your data load takes several hours and the database is flaky. So I'm thinking about ways around that.
So, knowing whether DBIx supports partitioning is useful to know. Knowing that the TypePad people at Six Apart have run into this issue, and created Data::ObjectDriver to specifically deal with partitioning issues is useful to know too, when considering ORMs.
David.
| [reply] |
|
dwm042:
Just a couple minor clarifications on partitioned tables:
First, as far as the application is concerned, there's a single table. While it *could* access the subordinate tables individually, it normally wouldn't. When you submit your query, the database server has the task of converting your query against the main table into queries against the subordinate tables: so your application doesn't get more complicated--only the database management does.
Secondly, MS SQL Server doesn't keep a table of pointers to the other tables: Instead there's a function that returns the table. I doubt that other database servers use a table of pointers, either, as that would be another table and index to maintain.
Another advantage of partitioned tables is that a single query on your table can break into a query per subordinate table, and those can be queried in parallel. So many queries are faster that would occur in a non-partitioned table.
...roboticus
| [reply] |
|
|
|