Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

A distributed design challenge

by Ovid (Cardinal)
on Jun 29, 2012 at 15:07 UTC ( #979134=perlquestion: print w/ replies, xml ) Need Help??
Ovid has asked for the wisdom of the Perl Monks concerning the following question:

Hi all,

My apologies for the vague query (and one that's not Perl-specific, but the core software is written in Dancer), but I've come across a distributed design problem that I thought Monks might find interesting and perhaps be willing to offer suggestions on.

The Setup

I'm writing real time bidding software. It works like this:

  1. My software receives a request to bid on an auction
  2. I return a bid on said request (I must respond in 85 to 100 milliseconds)
  3. Later, I receive a notification on whether or not I won the auction

Point number 3 above means I don't find out if I've won a request until shortly after I've made the request.

Currently, on a server farm of about 36 boxes, we're receiving roughly 400 bid requests a second, or about 11 requests/second per box. This is an extremely light load and will likely increase by one or two orders of magnitude when the system goes live.

We have various campaigns, each with individual daily and total budgets that cannot be overspent (a bit of wiggle room is allowed). The servers run separate databases and the only shared data source we have is a Redis server with two slaves, though we could implement others.

The Problem

(The monetary numbers are fictional)

Several times we've been close to our budget and we've had 16 servers respond simultaneously with winning bids, pushing us further over our budget than we would like. We keep track of our spending in Redis, so if all 36 servers respond to a bid at the same time, they might see that they've spent $99 (at a budget of $100), but not realize that all of them winning will push us to $135, or 35% over our limit.

In other words, it's impossible for the servers to know that other servers are bidding at the same time, or whether or not they'll win.

Solutions

One naive strategy is to have each server register with Redis and receive a unique, sequential number. Each server, when receiving a bid request, will do a modulus operation against a time segment to decide if it should bid or not. This requires that all 36 servers have their time synchronized (not too big of a deal), but effectively reduces our throughput capacity to 1/36th of what we would like.

Further, we have multiple campaigns. We might want to have server/campaign combinations, so if server 1 can bid on campaign A right now, then maybe server 2 is bidding on campaign B, thus ensuring that servers are not running idle (I can't tell you the number of campaigns, but there are many).

But then what happens if we get an important, but low-volume campaign where we get a bid request every five minutes or so? It's very conceivable that it will rarely, if ever, hit the server which is allowed to "handle" that campaign at the time. Further, calculating how often a particular campaign comes in is very hard (perhaps not NP hard, but still hard).

Do any monks have experience here? Or do you have thoughts on how to approach this? Writing everything to a central RDBMS could cause serious contention issues as we'd use transactions to lock critical data. Any solution that increases response times by more than a few milliseconds is likely to be problematic.

Comment on A distributed design challenge
Re: A distributed design challenge
by choroba (Abbot) on Jun 29, 2012 at 15:25 UTC
    I have never programmed anything like that. Therefore, my idea is just a pure speculation: What about adding a random factor - sometimes (randomly), a server could respond with a bit lower result than normally. You might have to tune the factor and amount not to lose too many bids completely.
Re: A distributed design challenge
by jethro (Monsignor) on Jun 29, 2012 at 15:52 UTC

    The budget of a campaign is stored centrally on Redis, right? Otherwise winning a bid and hitting the budget limit on one server would still not hinder other servers bidding for more

    So why isn't the bidding amount subtracted from the budget when a server bids instead of when it wins. Naturally this means it has to be added again when the bid is lost, so there is slightly more overhead.

    No overspending is possible in this case (substracting the budget must be an atomic operation naturally)

    But I might be totally off. I don't understand your calculation that if 36 servers bid $99 with a total budget of $100 and all win, that they are $35 over limit and not 36*$99 -$100.

      I don't understand your calculation that if 36 servers bid $99 with a total budget of $100 and all win, that they are $35 over limit and not 36*$99 -$100.
      I think you misparsed the original statement. He was saying that there is $1 total left in the budget at that time and all 36 servers are trying to spend $1 at the same time.

        Ok, this makes sense. They would bid only $1 because their budget is limited to $1, instead of making the bid they calculated as sensible.

      I agree (except for the minor confusion with some numbers in the last paragraph).

      Bids probably need to be subtracted from a budget in Redis. You might have a separate "budget" that includes bids while keeping the budget that only includes payments for winning bids.

      You might also hedge and say that each bid only has a (say) 60% chance of winning so bidding $50 subtracts only $30 from the "budget less outstanding bids", realizing that this risks going over the budget if you win more than an average of 60% of bids (weighted by bid size).

      There are several complications of this approach:

      1. Each bid must talk to the single Redis master
        • This prevents horizontal scaling of Redis capacity
        • This causes an outage in bidding when the master fails (and it may take a while to promote one of the slaves to be the new master)
      2. You probably need to deal with a bidding host dieing and then failing to remove a bid in a timely manner, leaving the available budget unnecessarily low for too long

      Also, the example Ovid gives seems way too optimistic and thus to greatly understate the risk. If the budget is $100 and $50 of it has been spent and there are no outstanding bids, then 36 systems could each bid $50 and end up spending $1800 and exceeding the budget by 1750%, not a measly 35%. So that makes me suspect that the explanation was incomplete.

      But winning bids already have to access the Redis master, so unless winning bids are a small fraction of regular bids, there isn't much benefit to horizontal scaling with Redis slaves. But I could certainly see the timing being less critical for recording winning bids (and thus an outage on the master would prevent bidding but not be as serious of a problem for the recording of winning bids).

      I can see a lot of approaches for dealing with better scaling and higher availability while still subtracting each bid from a central balance. But I don't think a simple Redis set-up would be the solution. I'd probably want multiple masters so you might want to look at membase or riak, for example (though that might not really address scaling).

      You could also shard your requests so that bids against budget $X only go to (say) 3 of the available servers and then each server can bid upto 1/3 of the available budget (using a server-local central bid amount tracker, perhaps a local Redis instance). Then your 100s of budgets spread load across your dozens of servers but each budget only spreads over a smaller subset. That obviously can get complicated to maintain that mapping in a way where load is sufficiently balanced.

      You can also just roll your own somewhat trivial multi-master infrastructure where each master is responsible for broadcasting updates to other masters. You can even use "broadcast" or "multicast" packets for notifying all masters "at once", if you seriously trust the capacity and reliability of your network. This also brings to mind pubsub.

      - tye        

        I had considered subtracting bid amounts from the budget, but rejected it for several reasons, all of which are problematic. First, if 36 boxes bid simultaneously, we could overshoot our budget unless we have boxes "lock" the bid amount, but that changes us from parallel to serial bidding and increases response time to the point where bids that should have been one would be lost.

        Serial bidding might be appropriate as we get close to the budget, but that raises another issue: the box that receives the notification that we lost is usually not the box that made the corresponding bid. Because we don't receive the bid amount in the loss notification, we'd have to search all 36 boxes to find out who bid what (or tremendously complicate our Redis setup by trying to store huge amounts of bid information in Redis and we're expecting our volume of data to increase by a couple of orders of magnitude).

        As for your suggestion of me being overly optimistic, that might be the case, but the bid amounts are (fortunately) small fractions of the budget, so it won't be catastrophic until we scale up.

        Riak is not an option because the response times on Riak are slow enough that we'll lose bids we should have one (though we're probably going to put Redis in front of Riak). I don't know about membase.

        We're probably going to go with a strategy of allocating separate budgets to separate boxes. It's more complicated, but it seems much safer. We lose a lot of bidding capacity for individual campaigns, but for many campaigns, I think it's doable.

      We considered subtracting the bidding amount from the budget when the server bids, but there are some technical reasons which make it more difficult to restore that amount if the server loses. Overspending is still possible when you have 36 boxes bidding at the exact same moment (unless one of them takes a lock on Redis, but that changes our parallel bidding to serial and kills our response times, meaning we lose bids we would want to win).

Re: A distributed design challenge
by SuicideJunkie (Priest) on Jun 29, 2012 at 16:14 UTC

    How big are the bids relative to the budget and what response options do you have? You could try a throttle back on the bidding rate or discount your bid amounts as you get closer to the limit:

    A) Knowing that you've got N servers and a general rate M of requests, and a typical time of T between bids and win notifications, a quick statistical calculation will tell you approximately how many other bids you can expect the other servers to submit before you know if you've won the current offer.
    Reduce your odds of bidding accordingly. EG: if the expected amount of wins is x2.0 what the current budget will allow for, then decline to bid with 0.5 chance.

    B) Set up a table of "Budget remaining" vs "Discount demanded". For example, if you're down to $40 out of the $100 initial budget, then reduce your bid amount by 5%. If you get down to $20 out of $100, then reduce bid amounts by 10%. $10/$100 => reduce bid by 20%.
    The idea here being that you will reduce the number of bids you win as you get close to the limit, while at the same time getting more profit from the bids you do happen to win.

      Bids are rather small relative to the budget and they are probably going to stay that way. Discounting bid amounts is problematic because the entire reason for bidding is because we want to win the auctions. There is a sharp drop-off in bids won as the bid decreases (the graph is almost vertical at a particular cut-off point. Instead, a more viable strategy is to not bid lower, but to bid less frequently.

Re: A distributed design challenge
by moritz (Cardinal) on Jun 29, 2012 at 17:26 UTC

    I don't have any experience with such systems, just a few naive ideas:

    • give every machine a local budget, and only talk to the central server when the local budget is exhausted. This shoudl reduce load and thus contention on the central server
    • give every machine a local budget, and have it obtain budgets from other peers when its own budget is exhausted. If it can't do that (or can't do it fast enough), don't approve the transaction
    • Make something home-grown, simple and very fast on the server side for querying and changing the available budget. You'll only need to support very few operations (query and decrease budgets), so something tailored specifically to that use case might well be faster than a generic solution like redis
    • keep statistics about how many of your bids were successful, and use these stastistics to allocate local budgets
Re: A distributed design challenge
by BrowserUk (Pope) on Jun 29, 2012 at 19:00 UTC

    Assumptions:

    1. There are many budgets, each for a different type(*) of request.
    2. All the requests arrive at the same place and are distributed to the 36 boxes via some kind of load-sharing mechanism.
    3. There is some mechanism -- I'm using time/amount below; but a "bid id" or similar, would work as well if not better -- for relating successful bids back to a list of outstanding bids.
    4. Update: also assumes that you do not receive explicit notifications for failed bids -- hence the need to track them.

      When the budget manager is discarding unsuccessful bids, following the receipt of a success, it can use the values of the failures to adjust its bidding strategy.

    Solution:

    • Re-gig the load-sharing mechanism to route all requests of a particular type to one box.
    • Each box can service multiple request types, but only one box ever services a particular request type.
    • Each box requests (from the Redis box), the current budget for each of its request types, and stores them locally.
    • As requests arrive, the servicing box can make bids up to but not beyond (or perhaps beyond, but only by some wiggle factor) the current budget for that type, and it decrements its local copy of the appropriate budget accordingly.

      It also maintains a list of time + amount for each bid it makes.

    • When notifications of successful bids arrive they are routed directly to the redis box; or another dedicated box charged with maintaining the actual budget within the redis box, and the redis copy of the budget is decremented by the accepted bids.

      When the Redis manager receives notification of a successful bid, it sends notification of the amended real budget, and the amount and time of the successful bid to the machine charged with make bids for that budget.

      It can then discard all outstanding bids prior to the successful one; and apply those still outstanding to the amended real budget to obtain a new local copy.

    This should allow the mechanism to bid as often as possible, whilst keeping the outstanding bid totals within the specified budget (or budget + wiggle).


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    The start of some sanity?

Re: A distributed design challenge
by cavac (Chaplain) on Jun 29, 2012 at 20:49 UTC

    You should have a central server that works on the account balance. Make it use UDP or persistent TCP connections.

    On the network side, it only has to support 3 actions, which namely are "reserve fund for bidding" (temporarly block that amount of money on the account), "bid has succeded" (remove the blocked amount from spending budget) and "bid has failed" (unblock money and release it for another bid request).

    For speed, you could devise some kind of hashing algorithm on the client side to determine which server manages which accounts.

    >Take a look at memcached. This is basically a similar system (just not specifically trimmed to your purpose). You might even be able to adapt it to your needs.

    "You have reached the Monastery. All our helpdesk monks are busy at the moment. Please press "1" to instantly donate 10 currency units for a good cause or press "2" to hang up. Or you can dial "12" to get connected directly to second level support."
Re: A distributed design challenge
by BrowserUk (Pope) on Jun 29, 2012 at 21:20 UTC

    BTW, 36 boxes to handle even 40,000 requests per second seems like extreme overkill.

    I strongly suspect that a decently spec'd, single box with one thread/listener per budget could handle that load. And by virtue of shared memory, greatly simplify the process along the way.

    In an extremely crude analogy run on my 4-core, single socket machine, I've been able to service 60,000+ request-responses per second via UDP with average response times of the order you require.

    If it were not too late to re-specify the hardware, I'd seriously consider using a single, 4-socket(32 or 40 cores with 64/80 hyperthreads) SMP shared memory server. It should easily handle the scale of loads and times you are describing.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    The start of some sanity?

      You are right that 36 servers is overkill, but we have a large setup of servers designed to handle very large loads and it was easier to put our software on these servers than to design a smaller system. Plus, the load I mention is actually very light. We've restricted the requests for the test. In a full-scale production system, we expect at least two orders of magnitude more requests.

        Plus, the load I mention is actually very light. We've restricted the requests for the test. In a full-scale production system, we expect at least two orders of magnitude more requests.

        The 40,000 figure I used took that 2-orders increase into account.

        You are right that 36 servers is overkill, but we have a large setup of servers designed to handle very large loads and it was easier to put our software on these servers than to design a smaller system.

        I can understand the motivation for re-using an existing setup if it is lightly loaded.

        That said, the whole problem becomes significantly easier to program by utilising SMP shared memory to hold the budgets; And much easier to guarantee if a dedicated machine handles it rather than relying upon excess capacity on the other systems where spikes in the other applications might break this applications latency requirements.

        I'd be tempted to try and concentrate the other application(s) onto 34 boxes and filch 2 to dedicate for this application in a primary + hot failover setup.

        Of course, I know nothing of the other applications or the internal politics involved that likely make that impractical.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        The start of some sanity?

Re: A distributed design challenge
by flexvault (Parson) on Jun 30, 2012 at 13:31 UTC

    Ovid,

    I asked something similar (but different) in the past, and BrowserUk came up with a client/server set of scripts which emphasizes the extraordinary performance improvement using persistent connections for inter-task or inter-machine communications. His scripts are here: Re^7: How to maintain a persistent connection over sockets?. You may want to download them and play with them. The results are quite interesting!

    Also BrowserUk's suggestion about the design of the server is right on. I would just enhance it to be 2 machines at two different locations (redundancy and disaster recovery). But you could always improve that later.

    Others have already stated some really good ideas, but I'm confused by one thing.
      How can the 36 machines get the same input all at once and return 36 similar responses?

      That seems like a waste of resources and the inter-machine communications could eat up a lot of your maximum response time!

    To scale, you need to have all machines/cores/etc. sharing the work-load not duplicating the process.

    On the other hand that was done by the IBM Watson computer ( Jeopardy fame ), but that was to see how many different processes returned similar or exact responses, so that a master task would tally the results and 'speak' with the 'best' answer.

    Good Luck, it sounds like a fun project...Ed

    "Well done is better than well said." - Benjamin Franklin

      As I understand it, it would not be the same request, but similar requests.

      EG:

      • 2pm : Ad budget has $0.10 left, typical bid is 5 cents
      • 2pm + 5 ms : Box 3 gets a request to bid on an ad slot at foo.com
      • 2pm + 6 ms : Box 8 gets a request to bid on an ad slot at bar.com
      • 2pm + 7 ms : Box 13 gets a request to bid on an ad slot at baz.com
      • ...
      • 2pm + 750 ms : Box 3's bid wins - Budget = $0.05
      • 2pm + 751 ms : Box 8's bid wins - Budget = $0.00
      • 2pm + 752 ms : Box 13's bid wins - Budget = -$0.05
      • ...

Re: A distributed design challenge
by sundialsvc4 (Abbot) on Jul 02, 2012 at 23:46 UTC

    I would go so far as to suggest that you really want to minimize the number of machines, because communication between different machines on a rack has to go through networking hardware.   Communication between processes on a single machine does not.   While you have presented this as “a distributed design challenge,” it fairly screams to me that it is not; that it emphatically should not be.

    It would be reasonable to be able to channel the requests for a particular auction to a particular machine as a way of load-balancing, but you undoubtedly do not want multiple machines bidding on the same auction.   In fact, I would go so far as to suggest that, whereas it is “intuitive” that this must demand a highly parallel solution (both inside and outside each box), it is to me almost certain that precisely the opposite is true.   Massive amounts of microseconds will be burned-up trying to decide if it’s okay to make this bid, instead of making it.   What you really want is for one software agent, who does not have to interlock with anyone anywhere, to be solely responsible for any given auction from its start to its completion and able to act entirely upon its own authority.   Have a way (but only insofar as it is provably necessary ...) to marshal the requests to the appropriate box and thence to the appropriate software-agent within that box.   Balance the system by adjusting either or both of those two controls.   Now, upon receipt of a valid request, the software agent can make its decision instantly, based solely on information which it has in its hand, without consulting anyone and without waiting for anything.   When the gavel falls, it can inform others of the concluded outcome.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://979134]
Approved by marto
Front-paged by ww
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others lurking in the Monastery: (8)
As of 2014-10-31 07:49 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    For retirement, I am banking on:










    Results (215 votes), past polls