To answer your first few comments:
- get more user time from cpu
Not an option. We have an installed base of servers--this is a fixed resource. And this task is sharing with a CPU hungry daemon that always gets priority (Squid).
- get a real db and maybe offload to another machine
Another machine is not an option. Again with the fixed resources conundrum. A real db is an option, but I believe if I fix my "pathological" use of BerkeleyDB I won't need to, and would actually lose performance in the bargain. A real relational database can never be as fast as an equally tuned key:value database when used for storing key:value information--the relational layer, the SQL layer, and a bunch of other stuff insures that.
- concentrate on raw writes within a guaranteed time rather than doing processing while writing, etc.
This program will already run at an extreme niceness level (the Squid gets what she wants, everybody else waits his turn). The File::Tail module is handling the 'when you've got the time, I'd like to do some work' catching up stuff. This program is doing an important job, but one that has to take a back seat--thus the reason efficiency is important, it can't fall too far behind and still provide reasonable functionality.
But it seemed like you must have something pathological happening if a 200/sec routine drops below 10/sec.
I agree. There is something pathological--I believe in the parent insertion routines. It's just a matter of how to fix the pathology without kill the patient.
Next interesting thought:
- Throw away all those joins and splits. No more strings, throw away all that "http" and "Inserting parent" etc. Too much Perl going on.
And replace them with what? The functionality is complete at this point--but is too slow. We can't lose functionality to fix the speed, as a useless program that runs really fast is still a useless program. ;-) I am beginning to suspect that using a more complex database that can handle some of the 'think' for me, rather than doing it all myself in perl might be worth the tradeoff. If I were using a relational DB, the parent->number_of_children could be incremented by accessing that particular entry rather than pulling in and parsing the whole parent structure--behind the scenes it still has complexity, but maybe the MySQL folks know how to handle that complexity much better than me and perl do.
As I've said in another post this morning, the problem I believe, is boiling down to the parent->child relationship. How do I keep it current, accurate, and make it go fast? The most expensive thing in most add_entry calls is probably the second add_entry call that is made to handle the parent relationship. So if I can make the parent relationship cheap, I can fix the slowdown over time. I think.
Anyway, thanks for the links and the pointers. I'm going to stick with BerkeleyDB for the time being, and attempt to make my code a little less pathological.
The first step for making things less pathological is to:
Add an in-memory hash to speed exsists? checks for the parent.
Reduce the size of the parent entries, probably converting to a multiple entry system, to prevent changing the size of the parent (a number_of_children field with the kids referenced by $parentmd5:number). Hopefully this is cheaper than the current split/join fun that happens when inserting a child into a parents child list.