Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Working with large number of small constantly updated records

by techcode (Hermit)
on Apr 27, 2009 at 20:12 UTC ( #760419=perlquestion: print w/ replies, xml ) Need Help??
techcode has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks, I'm working on a project that involves managing data and TCP communication (various protocols).

Data itself is small in size (per each record), but there are many records - 10 per second is monthly average with peaks of ~50/sec (and I'm hoping to be able to work with at least 200/sec). Data is received, stored, processed (involves one or several things communicated with remote servers), updated (perhaps updated several times), and moved/deleted.

Data is stored in PostgreSQL DB - and during testing I noticed that with small amounts of data system that I'm implementing works rather fast. But performance degrades very fast with new data added and old updated/deleted. So the "blame" falls on PostgreSQL it seems.

I already cached all the data that I can in RAM (hashes and such), bundled many (up to 1000) updates/deletes in one transaction (eval { foreach(..){ ...} }; if($@){..rollback..}) where I could. And am now thinking about calling 'vacuum full' every now and then from the processes them-self :) Which I did just for the test, and from the simple tests it seems to keep the processing time of that state/process constant, instead of increasing as data passes through it. But I did it after every transaction (1000 records).

Anyway from the time data is received - until it's processed completely, it goes through several "states". Each state has it's own process that connects to DB and then in endless loop reads records in it's queue, processes them and updates/deletes data. Each state has it's queue table in DB containing usually just serial/auto_increment value, and another foreign key field to reference the data in it's central table. Though some queues contain all the data that particular state needs for processing.

I'm considering testing out if it would be better to have only the central table by adding state field to it - where each process would get data for itself by saying SELECT * FROM data WHERE state = 'my state'; and all of them would be updating that one table.

That way would mean far less new and deleted records - my estimate is that for each "real data" record we have ~10 supporting records created/updated/deleted - that's how I inherited the system. Though on the other end it would mean many processes accessing just that one table ...

Another idea I'm playing with in my head, is to streamline the process and only have one (or perhaps two) process (though forking would be required) that would do everything that needs to be done with the data from start to finish. But it's tricky because data is received (and sometimes sent back via same connection) via TCP based server (long lasting TCP connection protocol). Processing then can mean just sending data to another server or also doing a DNS query, HTTP request ...etc. And several seconds/minutes/hours (though less than a day or two) can pass from start to finish.

The main data table is partitioned, table per day, with old data (max 3 days) moved into separate archive DB. Most of the code reads data directly from child table for current date (backup is to search in parent table). And it's usually having only a few fields updated to it. Other tables that are pretending to be FIFO queues contain small amount of records at any given time - but are constantly updated and deleted from - and are vacuumed by admins every couple of hours by hand (since full vacuum knows to fail and cripple the system - so it's not automated).

Any thoughts about what to try out, whether on Perl side, DB or something else? Perhaps some reference, document, whatever ... Thanks.


Have you tried freelancing/outsourcing? Check out Scriptlance - I work there since 2003. For more info about Scriptlance and freelancing in general check out my home node.

Comment on Working with large number of small constantly updated records
Re: Working with large number of small constantly updated records
by jettero (Monsignor) on Apr 27, 2009 at 20:16 UTC

    Cache::Memcached?

    I've seen people do some really clever things by calling sql functions that try the cache first, if not there, then fall back on DBI. Then you need a protocol for dropping cache entries (or deleting them from some other node) when things go stale.

    -Paul

      Thanks for suggestion but I'm not sure how would Memcache fit in since:

      If you have a high-traffic site that is dynamically generated with a high database load that contains mostly read threads then memcached can help lighten the load on your database.
      Though there is one part of the system where caching could help - in cases where we recently did some of those remote stuff (DNS/HTTP) so we wouldn't need to to it again if it's in cache.

      The thing here is that there are a lot of updates - and it is very important to maintain in what "state" the processing of a record is. Everything is payed per record by both our clients to us and us to business associates :)


      Have you tried freelancing/outsourcing? Check out Scriptlance - I work there since 2003. For more info about Scriptlance and freelancing in general check out my home node.
Re: Working with large number of small constantly updated records
by ELISHEVA (Prior) on Apr 27, 2009 at 20:45 UTC
      I still haven't done any real profiling - other than running the Perl code without database, and then with - and timing both.

      One part of the system observed works as: get record(s) from DB - find longest possible match in Trie (HoHoH... ref). Though I also played around with ordinary hashrefs and adding a char by char in a loop and trying to find it in one large (one "dimension") hashref. Trie's performance degrades with depth, and this is faster for deeper searches - but they meet around length where most of records that we are matching against are. After it finds it, it updates data back to DB. Old version of that part of system is implemented as Pg stored procedure - that loops by adding char by char and for each loop, trying to find it with SELECT * FROM table WHERE whatever = ?. Similar as Perl code I tested beside Trie - but runs through whole table (ok it's indexed and all but still ...) for each loop, and we are either way trying to free up DB from all but necessary work.

      And without DB (just a for(1..1000000){ looking it up }) we are talking about a range of 50K (or was it 500K) matches per second on ordinary desktop hardware (AMD Dual Core, 1 GB Ram - single channel DDR2, sata2 HD). With DB involved it's around 500 per second, and I tried 500, 1000 and 2000 records in one transaction: fetching that many with ... LIMIT X, starting transaction, then doing work in loop with updates/deletes, and committing at the end. I'm considering trying to do less - maybe 200, but actually building one large SQL string ("UPDATE ... rec1; DELETE ... rec1_queue; UPDATE ...rec2; DELETE rec2_queue; ...") and running it under one execute/transaction.

      Both PostgreSQL and Perl code running on same machine for those tests. Real DB is of course on separate DB server.


      Have you tried freelancing/outsourcing? Check out Scriptlance - I work there since 2003. For more info about Scriptlance and freelancing in general check out my home node.
Re: Working with large number of small constantly updated records
by erix (Vicar) on Apr 27, 2009 at 21:51 UTC

    I'm wondering, what version of postgres? What is the vacuum regime like? Is autovacuum running? Is it keeping up with updates? How large the table(s) /partitions?

    What is the hardware like? how much memory is there? How much is available to postgres?

    If it turns out that the lack of performance is indeed due to the database, you may be better off posting on pgsql-performance@postgresql.org (make sure you include pertinent detail, mainly postgresql.conf settings, basic hardware data, load characteristics etc).

    Btw, vacuuming is necessary, but be careful with VACUUM FULL: it's slow and needs a table lock (normal VACUUM doesn't).

      Version that I'm working on is 8.3.5 - no autocacuum, though I run it manually basically before (batch of) tests and now playing with vacuuming them from within Perl code. My test data is small - I test with 10.000 to 100.000 records so sizes are small.

      Real system has ~5 new records per second + all the queue tables for each of those (and it's the queue tables that are being updated/deleted from) and the idea is that they are "small" or preferably as close to 0 records as possible. It's relatively recent 8.x version running on a dedicated server double Xeon something with 2 (or perhaps 4) GB of RAM. At most data that is kept there is for last 3 days - those queue tables are vacuumed every couple of hours (manually) since each of them gets ~ 500K records added and deleted each day. Main data table being partitioned per day (so ~ 500K records that are added and from time to time updated - per partition - 3 partitions) each partition deleted after 3 days - so no vacuum is running on them.

      I know that full vacuum is locking the whole table - the fact that there are only two processes working with any given queue table (previous state process, and the one processing that queue/state) make it acceptable.

      IMHO the problem is that DB's aren't primary meant to be queues where in one day you get 5 million records added/deleted (estimate for 10 queue tables). But at the same time you need transactions to be sure where each and every record is in the processing. OTOH all servers are having up-time of up to 3 years - so it might be OK to do everything in RAM and occasionally just update the DB. Another alternative I'm looking at is the DBM::Deep since it sales pitch is quite impressive:

      A unique flat-file database module, written in pure perl. True multi-level hash/array support (unlike MLDBM, which is faked), hybrid OO / tie() interface, cross-platform FTPable files, ACID transactions, and is quite fast. Can handle millions of keys and unlimited levels without significant slow-down.
      In this system everything is called by it's Key/ID (anyone mentioned hashes?), or in FIFO sequence (arrays?). With only some reports needing real SQL. Of course using DBM::Deep would give the PHP guys that are working on administration interface (web) quite a headache :) So I'm guessing I'll try to just shift those queues to DBD::Deep and see what happens.

      Have you tried freelancing/outsourcing? Check out Scriptlance - I work there since 2003. For more info about Scriptlance and freelancing in general check out my home node.

        If autovacuum is off on an 8.3 system, someone disabled it. Why was that done? It may be causing problems; especially as you say that running vacuum 'manually' fixes the problem for a particular process. What happens if you put it back on?

        Also, isn't there an index missing from some query, or something similarly simple? Has ANALYZE been run regularly?

        (Of course, this only applies when you are actually sure the db is the bottleneck.. )

Re: Working with large number of small constantly updated records
by BrowserUk (Pope) on Apr 27, 2009 at 23:38 UTC
Re: Working with large number of small constantly updated records
by NetWallah (Abbot) on Apr 28, 2009 at 05:30 UTC
    What you are describing appears pretty close to what the free RRDtool was designed for (Platform independent AND free!).

    If you use that, you get amazing performance, as well as automatic database pruning/aging, and zero maintainance, because the database size never changes. You also get graphing tools built-in!

    If you are analyzing for trends, rather than absolute data points, then this is the tool to use.

         ..to maintain is to slowly feel your soul, sanity and sentience ebb away as you become one with the Evil.

Re: Working with large number of small constantly updated records
by BrowserUk (Pope) on Apr 28, 2009 at 06:52 UTC

    See PgQ (pdf) for an interesting look at the various ways of implementing message queues on top of an RDBMS (handily Pg), along with their limitations and finally their solution--which they naturally claim addresses all the limitations.

    Their solution is OSS (BSD) but in Python. If that's a barrier, then at least their design and analysis may be of interest.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      For a wider overview, incl. pgq, see also this talk about the very scalable Skype architecture.

      update: fixed that link. thanks :) (much more interesting stuff on that wiki, btw)

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://760419]
Approved by jettero
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others romping around the Monastery: (7)
As of 2014-12-29 01:31 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (183 votes), past polls