|Perl: the Markov chain saw|
Parse data into large number of output files.by Rhys (Pilgrim)
|on Sep 29, 2004 at 01:32 UTC||Need Help??|
Rhys has asked for the
wisdom of the Perl Monks concerning the following question:
Okay, How can I create an array of filehandles? was a lot of help, since it shows how I can keep a bunch of files open at once (combined with a hash using filenames as keys, very cool).
Too bad it's not what I need. Let's back up, shall we?
I have a log file. From UDP port 161 (SNMP traps) to snmptrapd to syslog-ng and into a file. File looks roughly like this:
I have seven input files, some gzipped, some not. Since they're log files, I can use (stat "$filename") to get the last modified time. Sort those to keep the log entries in order without having to mess with the timestamps in the log. Match /\]:\s(\S+):/ to get the IP address of the original trap sender.
Sounds easy, right? Here's the hard part: For each trap sender, I want to write an HTML file with only the traps for that sender. If there were only a few senders, I could just open the file, write the HTML 'top', add <pre>, then put the filehandle into a hash, and just write to the appropriate filehandle as the lines are parsed.
The problem is that there can be hundreds of original senders. Having that many filehandles open is certain to be problematic. The input data is about 100MB, so I'd rather not parse the data more than once if I can get away without it (although I wouldn't mind going through them twice if a first pass would generate some useful meta-information).
SO... What's a good way to deal with this? As it is, I may be faced with just opening the correct output file based on the sender IP, perhaps writing the HTML 'top', writing a line, closing it, and on to the next line. All that opening and closing files seems bad somehow, so I'm seeking the wisdom of the Monastery.
A second possibility - if they won't be used often - is to pull a list of IPs from the log files and dynamically write CGI scripts as the links instead of HTML files. The CGIs, when accessed, would `zcat logs.gz | grep <ip>`, basically generating the list of traps for a given IP at runtime. Quick to make, slow (and expensive) to use very often.
So what do you think? Easy way out of this? Should I just risk opening a zillion filehandles? Should I just open them and close them one at a time? Suggestions are welcome.