(tye)Re: Perl Performance Question
by tye (Sage) on Jun 13, 2001 at 19:49 UTC
|
If there is one thing you can say today, it is "disk is cheap".
I'd just have the collector write the data to disk files with sequence numbers or dates in their names. So either set a maximum file size of 400MB (for example) and just go to the next sequence number when you hit that or go to a new file every hour or N minutes.
Then have a separate process that extracts data from these files, summarizes it, stores it in a more permanent place, and finally deletes the file when it is sure that both it and the file writer are done with it (or have a separate step that deletes files so you can recover if you find a bug in the analysis or can purge unanalyzed files if the backlog gets really, really huge).
I'd think that any other scheme is going to be pretty vulnerable to loss of data.
-
tye
(but my friends call me "Tye")
| [reply] |
|
This is something I was thinking of. Hopefully, if the "processor" can't keep up with the "gatherer," it would be able to make up for lost time during non-peak times.
| [reply] |
|
I implemented this "disk queue" system, and found it very slow, even using a memory-based /tmp filesystem (on Solaris 2.6). I could only process about 200 flows per second, as opposed to the ~1,000 per second that should be gathered. That's a pretty extreme backlog-- I don't know if I would even be able to consume it during offpeak times. I didn't really see a slowdown in gathering when the "processor" was running, though, something I wondered about when writing to disk.
| [reply] |
Re: Perl Performance Question
by Henri Icarus (Beadle) on Jun 13, 2001 at 19:41 UTC
|
| [reply] [d/l] |
|
The problem with that is that if he is dropping packets, his machine isn't fast enough to process them in real-time which means the pipe will eventually fill up and writes to the pipe will block, thus causing him to miss packets sent while the write is blocked.
| [reply] |
Re: Perl Performance Question
by dragonchild (Archbishop) on Jun 13, 2001 at 19:30 UTC
|
One option would be to write a collector script which would then fork, for each packet received, an analyzer script which would take the packet, do stuff to it, then save stuff to a mysql DB. At that point, you can have whatever daemons you want look at that DB, irregardless of how that DB is populated.
I'm not fully conversant on how fork works, but I'm sure a number of people here are. Plus, you could just play with it. :) | [reply] |
|
I've tried this, but seemed to take a relatively bad performance hit with each fork. This is why I collected 1,000 packets before forking in the current idea, losing a few packets out of each thousand.
| [reply] |
Re: Perl Performance Question
by mikfire (Deacon) on Jun 14, 2001 at 09:53 UTC
|
Just to throw my two pennies into the mix, I am not surprised that either forking for every packet or implementing the disk queue was slow. Both forking and disk I/O have high overheads. I really think your initial solution was a pretty fair idea.
If this is still too slow, I might suggest a three-process approach. The packet catcher will spawn a child every 1000 packets. The child will spit the packets to disk while the catcher gets back to work. A third process watches for new files to be created ( maybe naming each file with the child's PID so you would know when it exited and the file was complete ) and does the parsing. This would gain some speed since the third process could keep a permanent connection to the database - DBI->connect is slow. It also makes this almost nightmarishly complex.
Just brainstorming now. What if you were to use ( since I just offered an answer using these ) one of the IPC shared memory modules? Using some kind of ring buffer in the shared memory, this would allow the parent and child to work
asynchronously. It would also elminate some significant overhead, as the child could hold a more or less permanent
connection open to the database.
mikfire | [reply] |
Re: Perl Performance Question
by mattr (Curate) on Jun 14, 2001 at 11:46 UTC
|
What kind of processing are you doing?
Could you do it a lot faster if you do a bunch of packets
at once?
You need a large FIFO buffer filled by a capture process
that is niced fast enough to handle your I/O requirements.
I should think Perl could handle it if you are not asking
for anything crazy. Some You could probably even just dump whole
blocks of packets into mysql records; they can get very large.
I also glanced at PDL::IO::FastRaw which suggests mmapping
a file for faster access but that seems like overkill.
Then another process would come in periodically (or more
nicely) to service that buffer doing the data reduction and analysis you need,
assuming that this is necessary because of a long capture
session. It sounds like right now you are getting caught
in overhead. One thing I can say is that you might save
a lot of time if you can get Mysql to do the reduction on
a lot of records at once and store results in a separate table.
Another thing you could look at is using study() before
running a batch. You could also look for a module which
runs batch processing in C. | [reply] |
Re: Perl Performance Question
by Spudnuts (Pilgrim) on Jun 13, 2001 at 23:06 UTC
|
Would MRTG
work instead? It's a pretty nice traffic grapher; it's
also free and fairly easy to set up. | [reply] |
|
No, but I'm using it successfully to complement this information. These flows are src/dst IP pairs, and can be used to, among other things, track data usage by IP address for billing or track security violations.
| [reply] |
Re: Perl Performance Question
by Tuna (Friar) on Jun 14, 2001 at 03:30 UTC
|
My job is to collect, process and analyze NetFlow stats for a Tier-1 provider. I accomplish this using cflowd and Perl. As a typical collection for me usually involves in the tens-of-thousands of flows per second, I can emphatically say that in no way is Perl incapable of handling the flow rate you are describing. I too, am using netmatrix aggregation. And, I'm on the cusp of fully automating flow collection from about 25,000 interfaces worldwide!!! Msg me for some more details, if you care to. | [reply] |
Re: Perl Performance Question
by Mungbeans (Pilgrim) on Jun 14, 2001 at 14:44 UTC
|
You may be able to multi/thread this without using forks. I don't know what your data looks like so this may or may not work.
Architecture: 1 master co-ordinating process, an arbitrary number of children (depending on CPU's, OS) that do the work.
- Packet comes in, master assigns it to a child.
- Child reads packet and processes it, dumping the output in a central repository
The key bit is the children are always alive (you don't launch them with fork as this has a start up hit) but they're quiescent unless they've got something to do. The communication between the master and child processes needs to be very fast (disk io probably too slow) but you could use IPC (unix interprocess communication) between the processes which is faster I think.
If you keep losing packets, then add more children. This should work well if you have multiple CPUs on Unix. You will start to get processor bound with too many children.
Caveat: I haven't done this, I've seen it done in Informix 4gl which is much less functional than Perl. There should be some CPAN modules which look after IPC for you.
| [reply] |