http://www.perlmonks.org?node_id=971615

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Dear Perl Monks,

I have a script which is fairly simple, but requires a huge amount of data from disk. I want to be able to run it from a web script, but the data load time is enormous (~30 seconds).

I see two solutions:

1. Use something like mmap to persist the data between calls to perl. (I am not sure, but I think this may happen automatically due to use of page cache. I am running Linux, btw). I thought I *might* need a super-simple holder process that holds the data in memory (And does nothing more.)

2. Use a client-server scheme. I like this less because of possible issues like memory leaks. Ideally, it would be set up so that the "user" enters a line via telnet, and the client reads back a one-line response. (Yes, I'll firewall the ports for safety and validate inputs.) I saw Net::Server and Daemon modules on CPAN. Is either preferred?

Ideally, I would like to be able to run the process on each request (less likely to have memory leaks, etc).

Any wisdom on which way to go would be appreciated,

Padewan

Replies are listed 'Best First'.
Re: Bid data but need fast response time
by BrowserUk (Patriarch) on May 21, 2012 at 15:09 UTC

    Maybe something a simple as this is all you need? This loads a dictionary into a hash as dataset, and can service 100,000 requests from each of 4 concurrent clients at a rate of ~40,000 requests per second.

    Note: This is a cut-down version of a server that expects each client to make many requests via a persistent connections. If your clients would only make a single request for each connection, I'd use a thread pool architecture, but I expect the throughput to be at least as good.

    #! perl -slw use strict; use threads ( stack_size => 4096 ); use IO::Socket; use constant { SERVERIP => '127.0.0.1', SERVERPORT => 3000, MAXBUF => 4096, }; sub s2S{ my( $p, $h ) = sockaddr_in( $_[0] ); $h = inet_ntoa( $h ); "$h:$p"; } my %DB :shared; chomp, $DB{ $_ } = $. while <>; close *ARGV; my $lsn = IO::Socket::INET->new( LocalHost => SERVERIP, LocalPort => SERVERPORT, Reuse => 1, Listen + => SOMAXCONN ) or die $!; print "Listening..."; while( my $client = $lsn->accept ) { async { while( 1 ) { $client->recv( my $in, MAXBUF ); unless( length $in ) { print "Disconnected from ", s2S $client->peername; shutdown $client, 2; close $client; last; }; print "Received $in from ", s2S $client->peername; my( $cmd, @args ) = split ' ', $in; if( $cmd eq 'FETCH' ) { $client->send( $DB{ $args[ 0 ] } ); } else { $client->send( 'Bad command' ); } } }->detach; } sleep 1e9;

    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    The start of some sanity?

      Very nice. I suppose I could have my web-script check that the process is running and start it if it isn't running yet. (I might even keep your "sleep" so that my server processes expire every so often as a memory-leak counter-measure.)

      Quick question -- what is the ":shared" in:

      my %DB :shared;

      I expect somewhere between 1 and 100 simultaneous clients. Each connection should last no more than 1sec with 10k+ queries.

        I might even keep your "sleep" so that my server processes expire every so often as a memory-leak counter-measure.)

        You'll probably want to set to something less than 31 years then :)

        I expect somewhere between 1 and 100 simultaneous clients. Each connection should last no more than 1sec with 10k+ queries.

        Then I'd go for a thread pool server something like this (again, a cut down version of pre-existing test code, not a production ready code):

        #! perl -slw use strict; use threads; use Thread::Queue; use IO::Socket; use constant { SERVERIP => '127.0.0.1', SERVERPORT => 3000, MAXBUF => 4096, }; our $R //= 8; sub s2S{ my( $p, $h ) = sockaddr_in( $_[0] ); $h = inet_ntoa( $h ); "$h:$p"; } my %DB :shared; chomp, $DB{ $_ } = $. while <>; close *ARGV; my $Q = new Thread::Queue; my $Qcleanup = new Thread::Queue; sub responder { my $tid = threads->tid; while( my $fileno = $Q->dequeue() ) { print "[$tid] Servicing fileno: $fileno"; open my $client, '+<&=', $fileno or die $!; bless $client, 'IO::Socket::INET'; while( 1 ) { $client->recv( my $in, MAXBUF ); unless( length $in ) { print "Disconnected from ", s2S $client->peername; shutdown $client, 2; close $client; $Qcleanup->enqueue( $fileno ); last; }; print "Received $in from ", s2S $client->peername; my( $cmd, @args ) = split ' ', $in; if( $cmd eq 'FETCH' ) { $client->send( $DB{ $args[ 0 ] } ); } else { $client->send( 'Bad command' ); } } } } threads->create( \&responder )->detach for 1 .. $R; my $lsn = IO::Socket::INET->new( LocalHost => SERVERIP, LocalPort => SERVERPORT, Reuse => 1, Listen + => SOMAXCONN ) or die $!; my @clients; print "Listening..."; while( my $client = $lsn->accept ) { my $fileno = fileno( $client ); $clients[ $fileno ] = $client; print "[0] queing ", $fileno; $Q->enqueue( $fileno ); close $clients[ $Qcleanup->dequeue ] while $Qcleanup->pending; }

        In a test running 8 responders, it served 1000 responses to each of 100 concurrent clients and achieved an average response time at the clients of 0.002 seconds.

        That's clients and server running in the same box so no network latency. But on the other hand, that's 8 server threads and 100 client threads all running in the same box which will obviously adversely affect server responsiveness.

        Here's a graph of its response times using 8 threads to respond to 16, 32, 64, 100, & 128 concurrent clients.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        The start of some sanity?

        I see - it is an attribute. I shoulda RTFM.
Re: Bid data but need fast response time
by RichardK (Parson) on May 21, 2012 at 14:01 UTC

    Isn't a process that manages a large amount of data called a database?

      LOL!

      My data structure is more like a bloom filter. (Think protein sequences.)

      Well, SQLite is a database, but suffers from the issues above. And -- neither SQLite nor mysql has space-efficient storage that I need. I have tried a few dbms (gdbm, cdb) -- they seem to be too slow because of disk-based btrees. Is there a client-server dbm that can cache/persist MRU data in memory? This might work, although probably less space-efficient than my in-memory approach. (Maybe tokyo-cabinet or levelDB?? )

        All databases should cache MRU -- mysql has a really good read only cache.

        But performance will heavily depend on how you map your data structure to the schema.

        For a purely perl approach I think I might try Tie::File and split the data into fixed length lines, but without knowing more about your problem it's difficult to tell ;)

Re: Bid data but need fast response time
by Anonymous Monk on May 21, 2012 at 14:38 UTC
    Easy -- Tie::Hash
      Try -- Cache::Mmap -- benchmark your results.
        Cache::Memcached
Re: Bid data but need fast response time
by mrguy123 (Hermit) on May 22, 2012 at 13:40 UTC
    Hi,
    An option I like to use in situations like this is to send a mail when the work is done. I know it's not an ideal solution, but it is used fairly often in bioinformatic web programs (which I assume is your field)
    The trick in this case is to fork the process so that the child process does all the work while the father process prints the output message (e.g. a mail will be sent...)
    If this is an option, I can give you some more code for the fork process
    Good luck with your research
    Mister Guy

    Real programmers don't comment their code. It was hard to write, it should be hard to understand and even harder to modify.