Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

Re: Re: Re: "The First Rule of Distributed Objects is..."

by exussum0 (Vicar)
on Oct 21, 2003 at 21:12 UTC ( #301072=note: print w/ replies, xml ) Need Help??


in reply to Re: Re: "The First Rule of Distributed Objects is..."
in thread Multi tiered web applications in Perl

I'm not sure what your point is here. Yes, scaling typically involves adding hardware as well as tuning your code. What does this have to do with physical tiers vs. logical tiers?
Just reving up :)
First, you only separate things if they are so uneven in terms of the resources they require that normal load-balancing (treating all requests for dynamic pages equally) doesn't work. This is very rare. Second, you NEVER change links! There is no reason to change your outside URLs just because the internal ones changed. If you want to change where you route the requests internally, use your load-balancer or mod_rewrite/mod_proxy to do it.
Sometimes, you don't know ahead of time what gets used more than naught. Usually products evolve. And using mod_proxy or the likes is a cludge. What happens when you move something more than once? Regardless, doing the extra work for moving stuff around is ugly anyway. It's unnecessary work if you seperate your "hard work" away.
How does this help anything? The amount of work is still the same, except you have now added some extra RPC overhead.
Yeah, but you've moved your hard work away to pools dedicated to resources that can handle it. Things like static stuff doesn't need to hit db resources ever. Things that need to do intensive stuff goes to a pool that does that. It breaks down more and more.
Can you explain what you're talking about here? Are you saying that some of your requests will not actually need to use the db layer or application layer?
Ok, for instance, let's say your login page is realy fast, and so is your page after auth. Now let's say your preferences page is REALLY slow. It takes up lots of resources since it gets pegged a lot. Seperating out the logic that is so slow because it gets hit so much can be moved to it's own pool. Now you have one set { login, homepage} and another {preferences} which can be in two different pools. The poools don't have to be server farms of machines, but when things get more complex or not, you can allocate or take away resources for them.
Running well is a relative term. They will not run anywhere near as fast as they would without the RPC overhead. I'm not making this stuff up; this is coming from real applications in multiple languages that I have profiled over the years. Socket communication is fast, but it's much slower than most of the other things we do in a program that don't require inter-process communication.
You are 100% right, but the time difference is insignificant. A pooled connection vs an in-machine IPC call's speed is a magnatitude faster, but in terms of user experience, it is so small, that you can hardly notice.
And that's the other problem: the RPC overhead forces you to replace a clean OO interface of fine-grained methods with "one fell-swoop..." This is one of Fowler's biggest complaints about distrubuted designs.
No it doesn't. They are called transfer objects. Just a basket where you say, I want NN and it returns back in one request. Nothing particularly messy about it. If you do it generic enough, I'm sure that you could abstract it out to many many uses and not 1 dedicated object transfer.
I'm not sure where you're getting this from. mod_rewrite (the most common choice for doing things based on URL) is very fast, and the hardware load-balancers are very very fast.
Yup, but then again, so are RPC calls :)
What's so bad about a web farm? Every large site requires multiple web servers. And which parts are costly? I think you are imagining that it would take less hardware to achieve the same level of service if you could restrict what is running on each box so that some only run "business objects" while others run templating and form parsing code. I don't see any evidence to support this idea. If the load is distributed pretty evenly among the machines, I would expect the same amount of throughput, except that with the separate physical tiers you are adding additional load in the form of RPC overhead.
Ah.. that's the thing. evenly. You don't want everything running evenly. If slashdot could seperate out say, it's front page logic from its comment logic, then the front page will always be speedy and the comments section be its relative speed. As more people do commenty stuff, the home page stays right quick.
Think of it like this: you have a request that takes 10 resource units to handle -- 2 for the display and 8 for the business logic. You have 2 servers that can each do 500 resource units per second for a total of 1000. If you don't split the display and business logic across multiple servers and you distribute the load evenly across them, you will be able to handle 100 requests per second on these servers. If you split things up so that one server handles display and the other handles business logic, you will max out the business logic one at 62 requests per second (496 units on that one box). So you buy another server to add to your business logic pool and now you can handle 125 requests per second, but your display logic box is only half utilized, and if you had left these all together and split them evenly across three boxes you could have been handling 150 at this point. And this doesn't even take the RPC overhead into account!

Distributed objects sound really cool, and there is a place for remote cross-language protocols like SOAP and XML-RPC, but the scalability argument is a red herring kept alive by vendors.

Or not. Say the cost of rendering a page is small, s. You have 1 server that can deal with 10 connectiosn really well. On that same server, you have b, a big process that takes a lot of time. and 10 tiny processes t. b bogs down t to the point of "slow". You add another server. Things get "better" but imagine if you tier'ed it. You have three machines. One that handles s, one that handles b and one that handles t. the t-machine will alwyas run fast. And as more people use b, you add more resources for b. But as b continuously gets more and more poeple, T NEVER slows down. THAT is what you want to avoid.

You don't want to add to the entire pool and have to speed up everything in one fell-swoop. It's the same reason you have a 3d video card and a cpu completely seperate. Totally seperate purposes for different things. If your cpu gets pegged for whatever reason, your 3d video doesn't. You can tweak the 3d card or even replace it w/o having to go through hell.

Btw, there's always statistics, but we can always skew them in various ways. I can quote numbers any way i want, even to refute my own argument. But you can't refute that if T stays simple and fast, and B gets more complex, that T would be unaffected. :)

Btw, my GF is ticked from all the typing you are making me doing. She was on the phone and kept thinking i was doing something more important, from the clakity clak i was making :)

Play that funky music white boy..


Comment on Re: Re: Re: "The First Rule of Distributed Objects is..."
Re: Re: Re: Re: "The First Rule of Distributed Objects is..."
by perrin (Chancellor) on Oct 21, 2003 at 22:26 UTC
    Sometimes, you don't know ahead of time what gets used more than naught. Usually products evolve. And using mod_proxy or the likes is a cludge. What happens when you move something more than once?

    It's not a kludge. Reverse proxying is widely used, and so is hardware load-balancing. IBM will be glad to sell you a commercial reverse proxy server if you prefer, but it's all the same idea. It's also trivial to change how URLs are being handled. You can send all requests for /foo/ to a special set of servers with a single mod_rewrite line, and even if you change it a million times no one on the outside will ever know or have to update a bookmark.

    Things like static stuff doesn't need to hit db resources ever.

    Of course. That's why I said you should keep your static web content separate from your dynamic requests. But this doesn't have much to do with logical tiers vs. physical tiers.

    Ok, for instance, let's say your login page is realy fast, and so is your page after auth. Now let's say your preferences page is REALLY slow. It takes up lots of resources since it gets pegged a lot. Seperating out the logic that is so slow because it gets hit so much can be moved to it's own pool. Now you have one set { login, homepage} and another {preferences} which can be in two different pools.

    Okay, what did we gain from that? If these were sharing resources on two machines before and were slow, we will now have one under-used machine and one overloaded machine. The balance is worse than it was.

    A pooled connection vs an in-machine IPC call's speed is a magnatitude faster, but in terms of user experience, it is so small, that you can hardly notice.

    If every request takes a tenth of a second longer than it did, no single user will have a slow experience but the scalability (in terms of requests that can be handled per second) will suffer in a big way.

    No it doesn't. They are called transfer objects. Just a basket where you say, I want NN and it returns back in one request.

    Forcing every communication between objects to be something that can be handled in one call just isn't a good design. Ideal OO design involves many small objects, not a few monolithic ones.

    Ah.. that's the thing. evenly. You don't want everything running evenly. If slashdot could seperate out say, it's front page logic from its comment logic, then the front page will always be speedy and the comments section be its relative speed. As more people do commenty stuff, the home page stays right quick.

    You keep talking about putting separate pages on different machines, but this conversation was originally about tiers, i.e. page generation, business logic, database access objects. Most dynamic pages will need all of these for any given request.

    It sounds like you are saying that you want to be able to sacrifice parts of a site and let them have rotten performance as long as you can keep other parts fast. I don't think that's a common goal, and I wouldn't call that scalability (how can you say the site is scaling if part of it is not scaling?), but it can easilly be done with mod_rewrite or a load-balancer directing the comments requests to some specific servers. (Incidentally, Slashdot caches their front-page and serves it as a static page unless you are logged in.)

    b bogs down t to the point of "slow". You add another server. Things get "better" but imagine if you tier'ed it. You have three machines. One that handles s, one that handles b and one that handles t. the t-machine will alwyas run fast. And as more people use b, you add more resources for b. But as b continuously gets more and more poeple, T NEVER slows down. THAT is what you want to avoid.

    The only way this could actually be an advantage is if you are willing to let b get overloaded and slow, as long as t does not. That is not a common situation at the sites where I've worked.

    You don't want to add to the entire pool and have to speed up everything in one fell-swoop. It's the same reason you have a 3d video card and a cpu completely seperate.

    The difference is that those are not interchangeable resources, i.e. splitting your rendering across the two of them doesn't work well since one of them is much better at it than the other is. In the case of identical servers with general resources like CPU and RAM, each one is equally capable of handling any request.

    But you can't refute that if T stays simple and fast, and B gets more complex, that T would be unaffected. :)

    I agree, but I think that if you added the necessary resources to keep B running fast (as opposed to just letting it suck more and more), then T would be unaffected in either scenario.

    Better be nice to the GF! That's one area where load-balancing is extremely problematic...

      I'm not saying load balancers are a kludge. You are putting words were there weren't. Things like mod_proxy are, since they are slow. I've seen them implemented and they just require an extra load.

      You don't create monolithic objects. You create containers. That's like complaining ArrayList is monolithic because you put a bunch of Integer objects in it. You create a DTO object whch contains everything you will need and one method that'd contains the returned objects in some organized fasion.

      I'm also saying, dont let certain parts go rotten. I'm saying some parts will require resources beyond that of others. You seperate them out. But you see, that's the problem with life. Sometimes B will be over loaded and slow w/o extra resources, and there isn't much you can do about it. It's life. But when those things get hit, they could slow down the entire service as a whole. You know the varios costs of things beofre hand and by seperating them out, you are prepared to allocate new resources to them when needed.

      Play that funky music white boy..
        Thanks for making me think about this some more. Let's summarize things a bit.

        I think that adding one ore more RPC calls to each request will add significant overhead. You think it will get lost in the noise. We probably won't agree on that, and I only have anecdotal evidence to prove my point, so agree to disagree.

        I think that forcing the communication between the presentation logic, domain objects, and database layer to be done with coarse-grained calls is a problem. You don't think it matters. Fowler talks about this in his recent book, and that chapter is mostly republished in this article (which requires a free registration). Here's a relevant excerpt:

        A local interface is best as a fine-grained interface. Thus, if I have an address class, a good interface will have separate methods for getting the city, getting the state, setting the city, setting the state and so forth. A fine-grained interface is good because it follows the general OO principle of lots of little pieces that can be combined and overridden in various ways to extend the design into the future.

        A fine-grained interface doesnít work well when itís remote. When method calls are slow, you want to obtain or update the city, state and zip in one call rather than three. The resulting interface is coarse-grained, designed not for flexibility and extendibility but for minimizing calls. Here youíll see an interface along the lines of get-address details and update-address details. Itís much more awkward to program to, but for performance, you need to have it.

        I would point out that with a fine-grained interface you could throw an error when someone passes in a bad zip code, while a coarse-grained one would necessitate gathering up all the errors from all the input, putting them in some kind of structure to pass back, and then making the client code go through the structure and respond to each issue. It just isn't as fluid. But we will probably not agree on this either. I do recognize that there are situations where everything can be summed up in a single call, but I don't think all communications between layers fit well into that.

        Finally, you seem to see the primary value of a distributed architecture as the ability to isolate individual sections of the app. You are talking about fairly large sections, like entire pages, so I think this is separate from the original question of whether or not the presentation layer and application layer should be on the same machine. I agree that there are uses for this, but I still think they only apply when you are willing to let a certain section of your application perform badly as long as another section performs well. I don't see how your statement that "some parts will require resources beyond that of others" applies to this. Of course they will, and at that point you can either improve your overall capacity to handle it, or isolate the part that is performing badly and let it continue to perform badly while the rest of the application is fast.

        I'll give an example of a use for this. Say you have an e-commerce site that has a feature to track a customer's packages. This involves a query to an external company's web site. It's slow, and there is nothing you can do about it since that resource is out of your control. Letting all of your servers handle requests for this could result in tying up many server processes while they wait for results and could lead to a general slowdown. You could either add resources to the whole site in order to offer the best performance possible for all requests, or you could isolate these package tracking requests on a few machines, making them incredibly slow but allowing the rest of the site (which is making you money) to stay fast. This could be a good compromise, depending on the economics of the situation.

        Note that if you then go and add more machines to the slow package tracking cluster to fully handle the load, I would consider the isolation pointless. You could have simply left things all together and added those machines to the general cluster, with the exact same result.

        I said this was easy to implement with mod_proxy, and it is, but you correctly pointed out that mod_proxy has significant overhead. There are some other benefits to the mod_proxy approach (caching, serving static files) but for just isolating a particular set of URLs to specific groups of machines you would probably be better off doing it with a hardware load-balancer.

Re: Re: Re: Re: "The First Rule of Distributed Objects is..."
by Anonymous Monk on Oct 21, 2003 at 22:55 UTC
    I'm not sure where you're getting this from. mod_rewrite (the most common choice for doing things based on URL) is very fast, and the hardware load-balancers are very very fast.
    Yup, but then again, so are RPC calls :)
    No they are not. You cannot invent facts.
      Uh, RPC calls are quite fast, otherwise NFS wouldn't work very well. :P
      Play that funky music white boy..
Re: Re: Re: Re: "The First Rule of Distributed Objects is..."
by chromatic (Archbishop) on Oct 21, 2003 at 23:01 UTC
    If slashdot could seperate out say, it's front page logic from its comment logic, then the front page will always be speedy and the comments section be its relative speed.

    That's exactly what Slashdot does by serving a static HTML file to anonymous users. No RPC necessary, unless you count the NFS directory shared between all web heads (hidden behind a load balancer) as RPC. It's out of process, so I say it's not.

      Right, but you forget, that once a user is logged in, it's still a faster process than comment rendering.
      Play that funky music white boy..
Re: Re: Re: Re: "The First Rule of Distributed Objects is..."
by tilly (Archbishop) on Oct 25, 2003 at 01:42 UTC
    Sorry for asking, but what is your actual experience with high-volume websites? How many have you been intimately involved with which, say, peak at over a million page views per hour?

    I ask this because I happen to know that perrin has direct experience with that level of volume, and has years of experience with a number of high-volume sites (admittedly most not peaking at over a million page views per hour) at companies with a variety of different technology mixes. I also know for a fact that your arguments about application servers are standard advertising copy from the vendors of application servers, and that doesn't necessarily match the experience on the ground. This flavours my reaction to what perrin has to say.

    Since I have raised the question of qualifications, let me be honest about my own. I don't have a lot of high-volume website experience. What I mostly have is enough math and theory to do back-of-the-envelope calculations on scalability and latency. And it is obvious to me that adding extra stages has to increase latency, CPU and internal network traffic, all of which at high volume show up in eventual hardware costs and the user experience. (Enough hardware requires more employees as well...) Plus users often judge you more on latency than throughput. Throughput you can buy hardware to cover, but latency is not something that you can ever get back once you lose it. (That is a lesson that I learned early, which is not generally appreciated nearly as much as I wish it was.)

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://301072]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (9)
As of 2014-12-28 06:38 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (179 votes), past polls