techcode has asked for the wisdom of the Perl Monks concerning the following question:
Data itself is small in size (per each record), but there are many records - 10 per second is monthly average with peaks of ~50/sec (and I'm hoping to be able to work with at least 200/sec). Data is received, stored, processed (involves one or several things communicated with remote servers), updated (perhaps updated several times), and moved/deleted.
Data is stored in PostgreSQL DB - and during testing I noticed that with small amounts of data system that I'm implementing works rather fast. But performance degrades very fast with new data added and old updated/deleted. So the "blame" falls on PostgreSQL it seems.
I already cached all the data that I can in RAM (hashes and such), bundled many (up to 1000) updates/deletes in one transaction (eval { foreach(..){ ...} }; if($@){..rollback..}) where I could. And am now thinking about calling 'vacuum full' every now and then from the processes them-self :) Which I did just for the test, and from the simple tests it seems to keep the processing time of that state/process constant, instead of increasing as data passes through it. But I did it after every transaction (1000 records).
Anyway from the time data is received - until it's processed completely, it goes through several "states". Each state has it's own process that connects to DB and then in endless loop reads records in it's queue, processes them and updates/deletes data. Each state has it's queue table in DB containing usually just serial/auto_increment value, and another foreign key field to reference the data in it's central table. Though some queues contain all the data that particular state needs for processing.
I'm considering testing out if it would be better to have only the central table by adding state field to it - where each process would get data for itself by saying SELECT * FROM data WHERE state = 'my state'; and all of them would be updating that one table.
That way would mean far less new and deleted records - my estimate is that for each "real data" record we have ~10 supporting records created/updated/deleted - that's how I inherited the system. Though on the other end it would mean many processes accessing just that one table ...
Another idea I'm playing with in my head, is to streamline the process and only have one (or perhaps two) process (though forking would be required) that would do everything that needs to be done with the data from start to finish. But it's tricky because data is received (and sometimes sent back via same connection) via TCP based server (long lasting TCP connection protocol). Processing then can mean just sending data to another server or also doing a DNS query, HTTP request ...etc. And several seconds/minutes/hours (though less than a day or two) can pass from start to finish.
The main data table is partitioned, table per day, with old data (max 3 days) moved into separate archive DB. Most of the code reads data directly from child table for current date (backup is to search in parent table). And it's usually having only a few fields updated to it. Other tables that are pretending to be FIFO queues contain small amount of records at any given time - but are constantly updated and deleted from - and are vacuumed by admins every couple of hours by hand (since full vacuum knows to fail and cripple the system - so it's not automated).
Any thoughts about what to try out, whether on Perl side, DB or something else? Perhaps some reference, document, whatever ... Thanks.
Have you tried freelancing/outsourcing? Check out Scriptlance - I work there since 2003. For more info about Scriptlance and freelancing in general check out my home node.
|
---|