Do you know where your variables are? | |
PerlMonks |
Re^8: RFC: OtoDB and rolling your own scalable datastoreby arbingersys (Pilgrim) |
on Jul 23, 2008 at 21:23 UTC ( [id://699703]=note: print w/replies, xml ) | Need Help?? |
Thanks for your thoughts. I'm a little confused about the 800 queries you mention. Here's what I get in terms of finding check-in/check-out history for a user. 1. User logs in We have to query 50 servers, definitely something you don't want. I'm beginning to think that if I were really building this application, I would have a standard RDBMS for some data, like logins, and OtoDB only for what high-volume read-intensive data exists. (A library system is actually a poor example in retrospect; I should have used something that made more sense from a scalable Internet site perspective...) 2. User goes to checkout history page (Now that we have his unique ID) Here's how the user_history table might look: user_id | check_out_date | check_in_date | book_title | book_id We query 50 servers for user_id. (If check_in_date is empty, the book is still out.) Total queries: 100 Redundant data is stored in the user_history table, i.e. full book info is also stored somewhere else, but it's been optimized for reading, and spread amongst the 50 servers. The real problem that I see is with the likelihood that a user will have checked out less than 50 books, and we're sending queries to servers that aren't going to have any data. (Of course, if this system existed, I doubt they'd start out with 50 data servers, or use anything other than an RDBMS for that matter.) I know that network traffic in terms of queries and network overhead for data returned increases by the number of servers present. But as the user checks out more than 50 books, it's more and more likely that he has data on every server. So we send 50 query requests over the network to the servers, and get data returned on the order of 2 or less records per server, each served over a 100MB switch port. As opposed to one server returning 50 or more records over a single 100MB switch port. I do like your ideas for hashing, but what you describe above seems more on the order of data sharding. Which some sites have done successfully to handle growth, from what I've read. The problem you have then is rebalancing data as servers with high volume get loaded down. In my example above, however, if I did use a hashing/whatever scheme to connect book_id to a particular server, then when the user clicks to read the full details, the system would know to go specifically to a single server for it's next query.
A blog among millions.
In Section
Meditations
|
|