Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

OT: Scalable web application architecture

by badaiaqrandista (Pilgrim)
on Dec 07, 2005 at 07:31 UTC ( #514748=perlquestion: print w/ replies, xml ) Need Help??
badaiaqrandista has asked for the wisdom of the Perl Monks concerning the following question:

Dear monks,

I am a Perl programmer maintaining a webapp product of a startup company. It is an online reservation system. Currently, our product is going through the transition phase. It started from a bunch of cgi scripts and ad-hoc database schema. Now it is projected to be a mature web application with a well designed module library and demanding requirements, like robustness, reliability, and availability.

As the programmer responsible for its architectural design, I have rewritten as much as 90% of its code, resulting in around 300 classes. The code is now fairly maintainable.

Now I have problem with scale. I didn't think about the scale when designing and then my class structure made the search functionality runs really slow. Search functionality is the heart of my application, so it must be quick. To fix this I made a lookup table to cache the intermediate results of the search, so that only the first search runs slow and the following ones run faster. And I also break some of the abstractions to get better performance from the class library. This solution worked.

However, nothing ever runs according to plan. Apparently, when it gets too many requests, some SELECT queries starts timing out. And also, UPDATE and INSERT queries lock the SELECT queries out of the cache table, making the search performance much worse.

I'd like to hear from other monks on how a medium or big web app application is (or should be) designed and deployed. I'd also like to hear if anyone can help me solve this scalability/speed problem. I use MySQL 4.1 on debian sarge.

English is not my native language, so aplogise for any grammatical errors. If you need me to explain more about my problem, just let me know. Thanks.

badaiaqrandista

Update: Thanks for all the replies guys. I guess I definitely have to modify the database structure to be more suitable for data retrieval and speed. I also have to split the 'search cache' table to hold different types of data, static and dynamic. Furthermore, I'd have to optimize as much as possible.

Comment on OT: Scalable web application architecture
Re: OT: Scalable web application architecture
by ask (Pilgrim) on Dec 07, 2005 at 09:22 UTC
    It sounds like you either need a substantial rethink of how your application works and does its work or you need some database optimization foo.

    The former is hard to guess at without more information on the system, but the latter is easy enough to take a few shot at.

    • Are you using MyISAM tables? Consider switching the tables that are getting "congested" to InnoDB.
    • Is it the caching table that's getting congested with reads and writes? Maybe you can do your caching more effectively? (Maybe use memcached?)
    • Maybe you can partition your data? Have "cat reservations" on one database server and "dog reservations" on another.
    • Also be sure to do all the "usual" MySQL optimization tricks (is it using the proper amount of memory? Do you have proper indexes on all tables? Enable the slow query log and check with EXPLAIN SELECT ... that all your queries are using indexes when possible. If it's not possible, rethink your table schema.
    You might find the slides from my ApacheCon/MySQL UC/OSCON talk useful.

    - ask (available for short contracts working on that sort of problems)

      # Are you using MyISAM tables? Consider switching the tables that are getting "congested" to InnoDB.

      Yes I am. But I did change to InnoDB but I got more "Can't connect" error than when using MyISAM tables, so I rolled it back to MyISAM. I haven't known much about InnoDB back then, so I probably didn't set its options correctly. The "Can't connect" error happen when multiple properties are trying to fill up the cache (see next answer).

      # Is it the caching table that's getting congested with reads and writes? Maybe you can do your caching more effectively? (Maybe use memcached?)

      Yes, it is. The cache is basically just a place to hold intermediate search results on a given date, so that future searches in that date don't have to do all the computation to decide whether a room-package combination is available on that date.

      Following is a simplified list of the cache table fields:

      field nametype
      property_idINT
      dateDATE
      room_idINT
      package_idINT
      valid_priceBOOLEAN
      available_room_countINT
      room_capacityINT
      package_min_guestsINT
      package_max_guestsINT
      available_to_travel_agentBOOLEAN

      The search only need to execute one SELECT query on this table if the content reflect the actual data. But when the actual data is modified, updating this table takes a long time, because the UPDATE query competes with massive SELECT queries. The update becomes worse when the update must be done for each date.

      Therefore, simple cache mechanism like memcached is not suitable for this. I do use it to cache other fairly static data, though.

      # Maybe you can partition your data? Have "cat reservations" on one database server and "dog reservations" on another.

      Because I am the only programmer and the only DBA, I hesitate in doing this. Even now, I have difficulty maintaining such a large code base I've written in the past 6 month.

      # Also be sure to do all the "usual" MySQL optimization tricks (is it using the proper amount of memory? Do you have proper indexes on all tables? Enable the slow query log and check with EXPLAIN SELECT ... that all your queries are using indexes when possible. If it's not possible, rethink your table schema.

      That's why I am asking on perlmonks. I want to know what's the optimal database schema should look like. Currently I have one database for each property, hence we have more than 400 databases with the same schema. I know it could be better. About optimization, I have done as much as possible everything in the Optimization chapter of mysql documentation.

        I have some experience with an online reservation system.

        You say that you have 1 database for each property, 400 in total with the same schema. It sounds like you should have all the properties in a single database. You can then link all your pricing, seasons and availablity etc to each property using your unique property id.

        My guess is that you are getting slowed down with all the database connects required for each query. Plus you will be repeating the same query on each database (400 times ouch).


        mr greywolf
Re: OT: Scalable web application architecture
by pajout (Curate) on Dec 07, 2005 at 09:22 UTC
    You should describe more details of your application to get some reliable answers, imho... Is it CGI or pure mod_perl? Apache 1 or 2? How many rows exist in db tables and how complex queries are? How many requests per second? How much memory and CPU GHz? What is successive response time for you?
    I can recommend to use some profiler to observe, what does the bottleneck really. Perhaps it is not SQL engine...
      You should describe more details of your application to get some reliable answers, imho... Is it CGI or pure mod_perl? Apache 1 or 2?

      It is web application running on mod_perl 1 with apache 1.3. But the classes that implements business objects basically don't care if it runs under mod_perl or from command line. They only need the database with a certain schema.

      How many rows exist in db tables and how complex queries are?

      There could be more than a million record in the cache table. The operation on the table is basically like this (in perl-like pseudocode):

      sub search { my ($arrival_date, $departure_date, $number_of_guests) = @_; validate_search_cache($arrival_date, $departure_date); my @search_results = ... do SELECT query to get a list of available + room-package combination between $arrival_date and $departure date f +or $number_of_guests ... return @search_results; } sub validate_search_cache { my ($arrival_date, $departure_date) = @_; my @validated_dates = ... do SELECT query that returns a list of da +tes between $arrival_date and $departure_date having valid combinatio +ns of room-package ... my @invalid_dates = ... look for dates between $arrival_date and $d +eparture_date that doesn't exists in @validated_dates ... foreach (@invalid_dates) { initialize_search_cache($_); } } my @field_list = ( { name => 'valid_price', op => sub { ... computation to check if price valid on a certain date ... }, }, { name => 'available_room_count', op => sub { ... calculcation to get the available room count on a certain date ... }, }, ... ); sub initialize_search_cache { my ($date) = @_; my @keys; my @values; foreach (@field_list) { push @keys, $_->{name}; push @values, $_->{op}->($date); } my $sth = $dbh->prepare("INSERT INTO (".(join ',', @keys).") VALUES + (".(join ',', map "?", @values).")"; $sth->execute(@values); }

      I hope that explains how the code works. The queries are mostly simple, except when doing the search for availability.

      badaiaqrandista

        Your caching via database table has a serious problem: It is simultaneously queried via select(s), update(s) and refreshed via 'insert into'. These operations have very different indexing needs to perform well, these are practically impossible to meet simultanerously. Not to speak of the locking issues.

        A better architectural idea would be to cache some data in a perl-Structure, update it there and push it back into the database after processing is done. This solution requires some amount of bookkeeping, in particular if it is a distributed system.

        If it is havily distributed (several geographical locations) think about replication.

        Please check the indexing of the underlying database architecture. Some millions of lines are easily processed if the indexing is right.

        How about data architecture? Have you separated the static data from the dynamic data? The property, package and room data seems to be static, the reservation data is dynamic, separate these into different tables.

Re: OT: Scalable web application architecture
by tphyahoo (Vicar) on Dec 07, 2005 at 09:59 UTC
    ...my class structure made the search functionality runs really slow.

    Seems like this happens a lot. One common speed/maintainability scenario is when programmers use an ORM (Object Relationship Mapper) to make dbi code more maintainable, either by building their own (frowned on) or using Class::DBI or DBIx::Class or friends (less frowned on, but still a performance penalty). Did something like this happpen in your case? If so, see OO concepts and relational databases ... Thoughts about Class::DBI and MySQL performance... and super searching will probably turn up more.

    What other factors do you reckon could be causing your "more maintainable" classes to suffer? Do you use complicated inheritance? Perhaps then you could recode to use composition instead, which legend has it is often a win. (I've found this to be so from personal experience.)

    In any case, perhaps you could just expand a bit on what I quoted. You already talk about this quite a bit in your post, and if you are convinced that this is really the full story, fair enough. But perhaps something else occurs to you.

      I do make my own DBI wrapper, which looks like DBIx::Class, simply because I don't like Class::DBI and I didn't know about DBIx::Class.

      What other factors do you reckon could be causing your "more maintainable" classes to suffer? Do you use complicated inheritance? Perhaps then you could recode to use composition instead, which legend has it is often a win. (I've found this to be so from personal experience.)

      I don't use complicated inheritance. I think, it is not the class structure that makes it suffer, but the way I structure them that makes it suffer. What I mean is I want each class to represent one business entity, which usually is represented in one or two related tables. When a class wants to access a table outside its responsibility, it should not use that table in a query, but should call a method of the class for that table. Sometime I break this rule to gain performance, but I don't do it liberally because it degrades the class maintainability. Is my approach right? Am I just too paranoid about getting the code too messy?

      I hope it answers your question. Thanks for your reply

      badaiaqrandista
Re: OT: Scalable web application architecture
by TedPride (Priest) on Dec 07, 2005 at 09:59 UTC
    Unless searches really have to be in real time, you might want to look into creating a second copy of your table for just searching, that will remain optimized and which only has searchable records (records flagged for deletion wouldn't be copied over). This secondary table would only have read requests, so it wouldn't have to worry about waiting on locks, and you could update it periodically (daily, perhaps).

    Also, as mentioned above, the most important thing for optimizing searches is indexes. If you aren't using them, start doing so, and if you decide to stick with a single copy of your table(s), use OPTIMIZE TABLE often on tables that get a lot of inserts and updates - especially if they have TEXT or BLOB types.

      The search has to be real time. So I need to be able to read and write efficiently on the same table. But I don't know how to create a schema that can be efficiently read and written. I know I need to use a database engine that supports it like InnoDB, but what's a good InnoDB configuration and what's the table schema should look like?

Re: OT: Scalable web application architecture
by jhourcle (Prior) on Dec 07, 2005 at 10:01 UTC

    Now I have problem with scale. I didn't think about the scale when designing and then my class structure made the search functionality runs really slow. Search functionality is the heart of my application, so it must be quick. To fix this I made a lookup table to cache the intermediate results of the search, so that only the first search runs slow and the following ones run faster. And I also break some of the abstractions to get better performance from the class library. This solution worked.

    However, nothing ever runs according to plan. Apparently, when it gets too many requests, some SELECT queries starts timing out. And also, UPDATE and INSERT queries lock the SELECT queries out of the cache table, making the search performance much worse.

    I don't see the point of caching on your own -- the database should be able to do this itself. The UPDATE and INSERT locking issue may be a sign that you don't have the correct indexes, so that the database can do row level locking, rather than table locking. Also, if you can speed up the individual SELECT queries, you might keep from getting too many backlogged requests.

    From the issues you're mentioning, I'd look into database tuning, and making sure that I had indexes where they're useful, and didn't have them where they weren't. I'd also make sure that the tables were ANALYZEd as needed.

      As I said on previous replies, the cache is used to store results of multiple long-running computations, so it is not just caching results of the SQL queries.

      Thanks for replying.


      badaiaqrandista
Re: OT: Scalable web application architecture
by Ryszard (Priest) on Dec 07, 2005 at 11:54 UTC
    You've not mentioned on what hardware you're running this system on either.

    If you've got a "large" database on a "small" machine, you'll be wasting your time attempting to optomise it. Get another machine!

    memory, is an order of magnitude faster than IO, so, if you can, cache your db in memory! Also indexes, create indexes for the most commonly hit items, or at least revise your indexing strategy

    You say its an online reservation system, then say you cache your results. This prima facie sounds like a bad design IMO. I've got in my mind something like a hotel reservation system, which is an OLTP system. It doesnt make sense to me to cache results, as they have the chance of being immediately out of date.

    That being said, if you're caching your result set, it is probably ok :-).

    Have you thought of prerunning the most expensive queries periodically (and even perhaps on an offline copy) and using Cache::Cache ? its super quick and adds only a couple of lines to each module.

    Its a work around for sure to a performance based problem, but its something you could get a good lot of milage from.

      It is running on a P3 machine with 1G RAM (I forgot the speed). The box is dedicated for MySQL server, so it's got all the resources to itself.

      Yes it is a hotel reservation system. It has a fairly complex rules to search for availability. Rooms have maximum capacity. Packages have minimum and maximum number of guests, there are also different special packages for valued guests, travel agents, online marketing referrals. Prices can have seasons and each season can have different prices for different room-package combination. There are more rules.

      Searching for availability must consider those variables. The results from these computation for a given date is stored in a so-called 'search cache', so the next search on the date don't need to do the computation again. The problem with this is that all search use this table, so it is havily read, but if its data isn't valid, it is updated/inserted, which makes it havily written as well.

      I am looking for suggestion to improve performance for a real time system like this and to hear what other monks think about my approach on the problem. Thanks a lot for reply.


      badaiaqrandista
Re: OT: Scalable web application architecture
by edoc (Chaplain) on Dec 07, 2005 at 13:36 UTC

    Maybe you need to think about the problem a bit differently. My last big a-ha! moment was when I realised I'd designed my database for data storage when what I should have been doing is designing it for data retrieval..

    It sounds like you want to be able to do a query something like:

    select room from search_table where max_capacity >= ? and pack_guests >= ? and pack_guests <= ? and valued_guest=? and agent=? and referral=?

    The idea is to turn things around so instead of trying to apply a set of rules to a table of data you put the rules in a table and use it to look up the data.

    In my case I was trying to process orders taking into account available stock, order payment, dispatch option selected by customer etc, etc. So to do a search for orders ready to be processed there were lots of lookups and calculating here there and everywhere to collect all the data needed for each order. I re-organised the order data so each order record had everything I needed and allowed me to do a single (although not simple) query to get all the orders ready to be processed. Once I had the data laid out correctly the database could easily and quickly process the query.

    On the database config side of things.. have you modified your mysql config to allow it to use as much memory as you think it has available? Tuning Server Parameters

    cheers,

    J

      Yes. That is exactly what I was planning to do with this table. Instead of looping over all combination of rooms and packages, I put them all in one table and create a query like you said. The problem is in keeping it up to date with the actual data, which reside in their own table in every property database (e.g.: room table, package table, season table, etc...). However, can you give me an example of how a data structure for data retrieval differ from data storage?

      I'll look into MySQL tuning again. It seems that everyone points me to it, so there must be something there.

      badaiaqrandista

        storage vs. retrieval typically comes down to optimizing for space vs. optimizing for time.

        For instance, there's the concept of data normalization -- you associate data with its main identifier (eg, a person has a address, and a reservation has a person, and a reservation has a room, and a room has a capacity, and on down the chain). The problem comes when you need to retrieve data -- it's stored in a very compact manner, but the system has to join multiple tables together to do its basic day-to-day operations. (and what if something changes? When Bob Smith moves, do we want it to reflect that his bill from the stay last year was sent to his old address, as opposed to his new address?)

        Often, when you're optimizing for retrieval, you have more indexes (based on how you're going to look for the data -- when optimizing for storage, you just need them for maintaining unique constraints (PK is a type of unique contraint), so you won't do things like having multi-field indexes (so you can pull out the most commonly used fields from the table, without reading in the whole record), and other such tricks.

        Um... I could ramble on for hours on this, but that should give you a quick idea of the issues.

        Oh -- and have you even tried analyzing the app? eg, check to where the program is spending all of its time?

      This is the exact use of views in databases, that allow you to do complex joins in a single select call between your app and the db. This lets the db optimize the query and lets you keep your tables normalized.

      cp
      ----
      "Never be afraid to try something new. Remember, amateurs built the ark. Professionals built the Titanic."
Re: OT: Scalable web application architecture
by talexb (Canon) on Dec 07, 2005 at 14:07 UTC

    My web application starting getting significantly slower recently, and after trying a few things (I use Postgres), I tried a VACUUM (secure in the knowledge that I'd been doing a daily VACUUM ANALYZE and this would have no affect).

    Heh.

    Actually, the VACUUM ran for about 45 minutes, and performance improved by two orders of magnitude. Yep, I had some web pages that were taking 45 seconds to over four minutes to appear .. I'm now back into the 1-3 second range, much more livable.

    I've been away from MySQL too long to remember what the equivalent command (if any) is, but don't forget to prod your database into performing well.

    Alex / talexb / Toronto

    "Groklaw is the open-source mentality applied to legal research" ~ Linus Torvalds

    ps 300 classes? Ouch!!!

    pps English is not my native language .. your post is better English than I've read from native speakers .. don't worry about that at all.

Re: OT: Scalable web application architecture
by weierophinney (Pilgrim) on Dec 07, 2005 at 14:43 UTC
    You don't say one way or the other, but are you doing all operations on a single database on a single server?

    If so, you may want to consider using mysql's replication features. Set up a few slave databases, and perform all read-only operations (searches) on those, while writing to one master database (write operations then get pushed to all slaves). This will give you more available database connections, and also spread the load out between several machines. You can use round-robin DNS to determine what DB server to utilize at any given time.

    I'd look at your indexing, as well. If you are having troubles with searches timing out, most likely you are not indexing the right criteria.

Re: OT: Scalable web application architecture
by ph713 (Pilgrim) on Dec 07, 2005 at 16:55 UTC
    In relation to another comment earlier on, if you're using OO methods to access the database, that can be a serious performance issue. If performance is a real problem for you, you may want to consider replacing Class::DBI or DBIx::Class with Rose::DB::Object. RDBO is considerably faster, as objective benchmarks have shown. I personally use DBIx::Class, but if performance was a more serious issue for me, I'd switch.

    Also, I'd highly recommend investigating PostgreSQL for this application. MySQL has its place, but this may not be a good match for it. PostgreSQL scales better under heavy transactional loads, and is generally a much more rigorous, professional-grade, extensible, and standardized solution than MySQL. Prior to the recent release of MySQL 5, I would have said you'd be crazy to run this on MySQL at all. It's still crazy to run it on 4.x. Either upgrade to 5, or switch to PostgreSQL. My opinion on the matter strongly leans towards PostgreSQL. It's just my opinion, but it's backed by a lot of experience with high-performance transactional RDBMS's from various vendors. (Oh, and if you make the switch, be sure to use PostgreSQL 8.1, no sense going through all the trouble of switching over and then using something outdated).

Re: OT: Scalable web application architecture
by dave0 (Friar) on Dec 07, 2005 at 17:42 UTC

    You should take a look at Brad Fitzpatrick's presentations about how LiveJournal scaled up their systems. It's not a HOWTO, but it's definitely a good read for anyone working on scalable web apps with Perl and MySQL.

    I believe the most recent version is at http://www.danga.com/words/2005_oscon/

      Thanks for the link. I have read the presentation couple of times, but it concerns more about scaling the whole architecture, including hardware and network connections. I am looking for something that is more code-wise and database structure-wise. My boss is pretty tight with adding new hardware, so I just have to live with it. And having to code with limited resource stretches my mind and challenges my knowledge, so I don't really mind him being like that.

      badaiaqrandista
Re: OT: Scalable web application architecture
by skx (Parson) on Dec 07, 2005 at 19:21 UTC
Re: OT: Scalable web application architecture
by dragonchild (Archbishop) on Dec 08, 2005 at 15:13 UTC
    Are you preloading all your classes in your startup.pl?

    My criteria for good software:
    1. Does it work?
    2. Can someone else come in, make a change, and be reasonably certain no bugs were introduced?

      Yes I preload everything. Does it has anything to do with scalability?

      badaiaqrandista
        Yes, quite a lot. The more you can preload in your startup.pl, the more there is in shared memory. The more you have in shared memory, the less memory each Apache child takes. The less each takes, the more children you can have running simultaneously. The more simultaneous children you have running, the more simultaneous requests you can serve.* Instant scalability.

        This doesn't just apply to your modules. One thing I have done in the past is to preload large constant data structures that are stored in the database. You can do this by loading them as a singleton class in your startup.pl. You can then access them by accessing the singleton class. It's kinda like global variables for your mod_perl application, but not sucky. You could also mark them with ReadOnly for even greater safety.

        Note that preloading only works if you don't modify the stuff you preloaded. If you do, then it's copied into the child's memory and you lose the benefits.

        *: Stas Bekman (the author of mod_perl) wrote an excellent 8-part series on optimizing your mod_perl application. Parts 3-5 discuss shared memory. They're a little difficult to get into, but you'll learn a ton about how to improve your mod_perl app's performance.


        My criteria for good software:
        1. Does it work?
        2. Can someone else come in, make a change, and be reasonably certain no bugs were introduced?

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://514748]
Approved by valdez
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others lurking in the Monastery: (5)
As of 2014-08-29 11:10 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The best computer themed movie is:











    Results (280 votes), past polls