http://www.perlmonks.org?node_id=1011309

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I am not allowed to describe the use case and it will sound weird, but we're trying to figure out the best way of handling thousands of small, event-driven state machines in a Web environment.

We have a standard Moose, Catalyst, DBIx::Class, TT, Postgres database stack and the potential for hundreds or thousands of users using the site at the same time. Each user can one one or more "workflows" attached to a given profile. Users choose which workflows they need or don't need. They'll have maybe three to five workflows at a time. Each workflow is a small state machine that should switch when the user takes a given action (which we're calling "events", so the term "event-driven" is misleading), such as checking out a document, editing it, visiting a new section of the site, etc.

We're trying to figure out the best way of writing this. Our primary concerns are performance and accuracy. It's better that the Web site crash than to have inaccurate workflows. We've considered having a small state machine module and representing all of the state machines as JSON and having all of them loaded into memory at once. With thousands of workflows, maybe this works, maybe it doesn't?

More importantly, what's the best way to trigger an "event"? Are we going to have to hardcode all of the actions everywhere? I suppose we can start applying roles at runtime to everything, but we're unsure of the best approach. We're not asking you to write the code for us, but any speculation on the best approach would be helpful.

  • Comment on Design question: handling hundreds of state machines in a Web context

Replies are listed 'Best First'.
Re: Design question: handling hundreds of state machines in a Web context
by Corion (Patriarch) on Jan 02, 2013 at 16:36 UTC

    You'll need to specify in more detail what you mean by accuracy. I assume that "accuracy" relates to the workflow(s) in progress, that is, the state of each state machine. Personally, I would at least write a continous log of each state change of each machine to be able to easily replay/restore a crashed session. Personally, I'm really fond of pushing the problem of keeping (shared) state to a database, so I would at least store the state and possibly also the log of the transition(s) of each state machine in database tables.

    One system that sounds a bit like what you're doing is Deliantra, a MORPG written by Marc Lehmann, the author of AnyEvent (among other things). I think it supports a fairly large number of clients and think its overall architecture is likely worth investigating.

      "Accuracy" means that if a machine should move from state A to state B, that transition takes place.

        Sure, but at what point do you accept the legal responsibility for the transition with regards to your customer, and how will you handle recovery? Is it OK to replay (a series of) transitions if you keep the state in memory and write the state to disk every five minutes, while keeping a transition log? Do you want/need two-phase commit, where you send a confirmation for each processed transition?

        How consistent needs the overall state of the system to be? Is it OK if all transitions for client A were processed but only the first half of the transitions for client B were processed? What if A and B own ultiple machines? What is the processing order of the transitions? Is it OK to process transitions in parallel across different threads? Is it OK to reorder transitions for a single state machine? Can a transition be cancelled or can it time out?

        Most of these guarantees should be answered by a proper messaging system, like IBM MQSeries or maybe ZeroMQ nowadays, or alternatively by having all clients write directly to a database. I think you will need one, but I don't have much experience with the advantages or disadvantages of such queue systems.

Re: Design question: handling hundreds of state machines in a Web context
by BrowserUk (Patriarch) on Jan 02, 2013 at 19:21 UTC
    Our primary concerns are performance and accuracy.

    The 'state' of any individual user is simply their ID, their current workflow ID, and the step they are currently on.

    A workflow is its ID, and (making a few assumptions), a list of templates representing the appropriate form for each step.

    Each of these can be stored in a table index by ID. When a user logs in, requesting two such small pieces of data directly by their promary index should be absolutely no problem for any RDBMS worthy of the name; even with 1000s of concurrent users. (Who will spend most of their logged on time staring at the screen or typing, the load on the DB will be minimal.) So performance should be no problem.

    As for accuracy, once the user clicks submit on any given step, there are a few possibilities:

    1. The transmission gets lost in transit.

      Assuming you aren't using too much client side trickery, the browser will time out and the user should be able to click the back button and attempt to re-submit.

      The user expectations of reliability should not be high if they fail to sucessfully submit as the intervening internet is not your responsibility.

    2. The submit completes, but your webserver crashes before the state is saved.

      Use a good webserver and test your cgis thoroughly.

    3. The webserver extracts the formdata and issues the DB update, but the DB server crashes.

      Use a reliable DB server.

    4. The update happens but is subsequently lost due DB server crash or disk failure.

      This is bread and butter DB stuff. Use a reliable RDBMS.

    In essence, on the basis of your description, this sounds very similar -- in terms of data flows -- to any multi-step ordering/shopping basket/booking process already actioned on a million websites. Ie. nothing extraordinary.

    Use reliable tools and don't get too complicated with either your client-side or server side processing.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Design question: handling hundreds of state machines in a Web context
by RichardK (Parson) on Jan 02, 2013 at 18:34 UTC

    I agree with Corion, store your state in an ACID database then state changes can happen in transactions. I'd choose postgresql but YMMV.

    From what you've described, I don't think you'd get much advantage from a queueing system but without seeing a full & detailed spec it's difficult to tell :)

Re: Design question: handling hundreds of state machines in a Web context
by sundialsvc4 (Abbot) on Jan 02, 2013 at 21:07 UTC

    It appears to me, technically speaking, that you actually have a pretty well-defined situation:   each Web session can be identified (through login-information maintained by cookies) to belong to a particular “customer,” hence to a particular (by some means selected...) “state machine” (“workflow”), and hence, per-session, to a particular flow point (“present state”) within that workflow ... by which the current set of POST or GET inputs can be responded-to.

    There are already numerous workflow-driven architectures in CPAN, e.g. POE, which, even if they are not entirely applicable, certainly can be used as architectural examples.   Yes, you probably will need to “hard-code the actions,” and there are many examples such as these of potential infrastructure. (Edit: AnyEvent?   Others?   Definitely ... check ’em out.   CPAN’s your oyster and your cornucopia ...)

    Thinking off-the-shelf about this, I think that the individual request processing sequence (as implemented e.g. by Catalyst and supported by PostGres or whatever-other state backing store), would be:

    1. Identify the user session in the customary way.   (Catalyst handles this...)
    2. From the session information, determine what state-machine definition is being used and the present state.   Instantiate that state-machine and set it to the present state.
    3. Submit the inputs to the state machine and gather its response.
    4. Update the new-state information into the session data store.
    5. Return the information provided by the state-machine to the user.
    6. Clean-up in anticipation of the forthcoming next request.

    In this model, it doesn’t matter how many users there are, nor how many finite state-machines (FSMs) there are, as long as the state machines follow some predictable taxonomy constructed from a reasonably flexible set of underlying, Perl-implemented primitive actions.   You instantiate the FSM, feed it inputs, save its new-state, return its outputs to the client, and you’re done.   That is certainly a well-trod footpath... and infinitely scalable.