Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris

Parse data into large number of output files.

by Rhys (Pilgrim)
on Sep 29, 2004 at 01:32 UTC ( #394813=perlquestion: print w/replies, xml ) Need Help??
Rhys has asked for the wisdom of the Perl Monks concerning the following question:

Okay, How can I create an array of filehandles? was a lot of help, since it shows how I can keep a bunch of files open at once (combined with a hash using filenames as keys, very cool).

Too bad it's not what I need. Let's back up, shall we?

I have a log file. From UDP port 161 (SNMP traps) to snmptrapd to syslog-ng and into a file. File looks roughly like this:

Sep 28 19:45:10 logsrvr snmptrapd[<pid>]: <ip1>: Trap msga. Sep 28 19:45:10 logsrvr snmptrapd[<pid>]: <ip3>: Trap msgg. Sep 28 19:45:10 logsrvr snmptrapd[<pid>]: <ip4>: Trap msgd. Sep 28 19:45:10 logsrvr snmptrapd[<pid>]: <ip1>: Trap msge. Sep 28 19:45:10 logsrvr snmptrapd[<pid>]: <ip2>: Trap msga.

I have seven input files, some gzipped, some not. Since they're log files, I can use (stat "$filename")[9] to get the last modified time. Sort those to keep the log entries in order without having to mess with the timestamps in the log. Match /\]:\s(\S+):/ to get the IP address of the original trap sender.

Sounds easy, right? Here's the hard part: For each trap sender, I want to write an HTML file with only the traps for that sender. If there were only a few senders, I could just open the file, write the HTML 'top', add <pre>, then put the filehandle into a hash, and just write to the appropriate filehandle as the lines are parsed.

The problem is that there can be hundreds of original senders. Having that many filehandles open is certain to be problematic. The input data is about 100MB, so I'd rather not parse the data more than once if I can get away without it (although I wouldn't mind going through them twice if a first pass would generate some useful meta-information).

SO... What's a good way to deal with this? As it is, I may be faced with just opening the correct output file based on the sender IP, perhaps writing the HTML 'top', writing a line, closing it, and on to the next line. All that opening and closing files seems bad somehow, so I'm seeking the wisdom of the Monastery.

A second possibility - if they won't be used often - is to pull a list of IPs from the log files and dynamically write CGI scripts as the links instead of HTML files. The CGIs, when accessed, would `zcat logs.gz | grep <ip>`, basically generating the list of traps for a given IP at runtime. Quick to make, slow (and expensive) to use very often.

So what do you think? Easy way out of this? Should I just risk opening a zillion filehandles? Should I just open them and close them one at a time? Suggestions are welcome.


Replies are listed 'Best First'.
Re: Parse data into large number of output files.
by davido (Archbishop) on Sep 29, 2004 at 01:42 UTC

    Perhaps you could do this in a couple of stages. In stage one, read all of the input files, and build up a database with a column for "trap sender" and a column for the logfile entry, each row in the DB representing an entry from one of the input logfiles. It takes only one database connection to build up the database.

    Then in the second pass, prepare a query such as "SELECT logentry FROM sometable WHERE trap_sender=?". Then one by one open a file for a particular user, execute the query with that user's name, and spill to that user's logfile the entries returned by fetchrow.

    That way, you never have to hold all 100mb in memory at once, you never have to hold hundreds of open filehandles, you don't have to suffer the poor performance of opening and closing filehandles hundreds of times over and over again, and you get some practice with DBI. DBD::SQLite would be a great lightweight database on which to build the implementation of this strategy.


Re: Parse data into large number of output files.
by BrowserUk (Pope) on Sep 29, 2004 at 01:46 UTC

    You may find FileCache useful.

    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "Think for yourself!" - Abigail
    "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon
Re: Parse data into large number of output files.
by Fletch (Chancellor) on Sep 29, 2004 at 01:46 UTC

    The real RDMBS suggestion is probably going to be the best way, but if you want to give it a go with multiple files check out FileCache.

Re: Parse data into large number of output files.
by tachyon (Chancellor) on Sep 29, 2004 at 02:34 UTC

    The problem is that there can be hundreds of original senders. Having that many filehandles open is certain to be problematic....Should I just risk opening a zillion filehandles?

    On what do you base you supposition that having lots of file descriptors open is a problem? What risks do you perceive? You assert but do you test. By default on Win2K you can have 509, on Linux 1021. 3 handles are used for STDIN, STDOUT, STDERR, so there are 512 and 1024 handles respectively available.

    C:\tmp>perl -e "open ++$fh, '>', $fh or die qq'$fh $!\n' for 1..$ARGV[ +0]" 512 510 [root@devel3 tmp]# perl -e 'open ++$fh, ">", $fh or die "$fh $!\n" for + 1..$ARGV[0]' 1024 1022 Too many open files

    But so what? Just increase the number if you need to. On Linux:

    [root@devel3 tmp]# ulimit -n 65535 [root@devel3 tmp]# perl -e 'open ++$fh, ">", $fh or die "$fh $!\n" for + 1..$ARGV[0]' 2048 [root@devel3 tmp]# ls 204? 2040 2041 2042 2043 2044 2045 2046 2047 2048 [root@devel3 tmp]#

    It is not actually the number of open file handles that will cause an issue. Depending on the underlying file system you will start to get issues if you go over10-20,000 files in a single direcotry with ext2/3. Reiser FS does not care.



      The biggest problem I've encountered with maintaining large numbers of file handles open is that it tends to cause the filesystem caching to work against you rather than with you.

      On NTFS, you can use the native CreateFile() API and provide extra information about the type of use you intend to make of the file. Using FILE_FLAG_NO_BUFFERING, using your own buffering and multi-sector sized writes can prove beneficial in alleviating this.

      Most of the limitations are embodied within the (almost POSIX) complient C-runtime semantics. It's quite probable that baypassing these on other filesystems could also be beneficial, but it probably requires fairly detailed knowledge of the FS concerned.

      Examine what is said, not who speaks.
      "Efficiency is intelligent laziness." -David Dunham
      "Think for yourself!" - Abigail
      "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon

      Two comments here:

      • First, you assumed that we were talking about physical limitation here, as how many file hanlders the OS allows. That's one thing, but not the only thing we are talking about. I agree it is good to have this in mind.
      • Second, with this physical limitation in your mind, you most likely don't want to go there, but rather stay somewhere below. I am not saying you suggested to go there, but rather a comment for the OP in general. He has to test and find out a reasonable number.
Re: Parse data into large number of output files.
by steves (Curate) on Sep 29, 2004 at 01:43 UTC

    If memory isn't an issue, my first choice would be to hash each line by sender, into a hash who's keys are senders and who's values are an array of lines per sender. At the end of the program, iterate over each sender, opening a file, writing the sender's lines to it, and closing the file.

Re: Parse data into large number of output files.
by Rhys (Pilgrim) on Sep 29, 2004 at 15:48 UTC
    I have read all of the suggestions and have the following comments:

    1. RDBMS: I will probably look into this in the near future. Slurping the data out of the files into an RDBMS would solve many problems. I should even be able to convince syslog-ng to stuff the messages into the DB as they arrive and do away with the flat files altogether (or use them merely as a fall-back archive). However, all of that is future development, so set that aside for now.

    2. Read into hash of lists, then print: This was actually my first thought, but the size of the input files is not controlled by anything. If there is a nasty network event, multiple GB of trap data could very easily be written which would definitely chew up all of my available memory (and it's a network management server, so that's not acceptable). Basically, I have a wildly unknown max value here, so I can't trust it.

      I suppose I could write out chunks of N*1024 messages, though, which would limit the open/close to every (interval) instead of every message. But since the number of senders can still cause huge memory usage (where each sender chews up N*1024-1 messages - which won't trigger a write - multiplied by M senders...), this probably still isn't the best solution for my case.

    3. Just open a zillion files: Again, the number of trap senders can vary wildly, so I have no way to trust that this number won't grow beyond any arbitrary limit I set. The system is Linux, and the last count on the number of senders is 584, so that's well below the current limit, but that can change in a hurry, and I'd rather just write this once. 64K would probably be big enough, but then what are the consequences of having that many FHs open at once?

      Basically, I'm loathe to go this route because I don't have any hard controls or expectations for the number of senders, and I don't want the thing to crash the first time the number of senders eclipses the FH limit.

    4. Cache of open files: This is another immediately-viable option. Basically, it puts a hard limit on something that otherwise has none. The dark side of this one is all the code required to maintain the cache itself. Shouldn't be too evil, though, and should chew up significantly less memory than any solution that involves buffering the messages in memory.

    5. Just do it one line at time: It could be argued that I should just open the file, write a line, and close the file, and see what the performance is like. If it doesn't suck, stop worrying about it. I have no argument against this (yet). The first version of the code will probably do this, since the FH caching algorithm can be easily added and it'll allow me to both guage the performance boost and test the REST of the code independently of this issue.

    I still have to follow some of the links provided (such as the file caching one), so I haven't finished my analysis of this, but the suggestions have all been helpful. I'm trying to code this is a fairly paranoid way, just because I've had to re-write most of the code I didn't write that way, so I'm just trying to save time. :-)

    Thanks for the help, all. Much appreciated!


Re: Parse data into large number of output files.
by pg (Canon) on Sep 29, 2004 at 02:21 UTC

    My solution would be:

    • Keep a list of file handles, but limit the number of file handles you could open at the same time, say 100.
    • For each open file, keep a timestamp on it. Update the timestamp, everytime when you write to it.
    • If you got a line, and its related file is open, fine, just write out whatever you want.
    • If the related file is not open, go thru the list, close the one has the longest unused time. open the one you wanted.

    Well, you can come up other ways, other than the last access time, to rank your files. For example, number of lines wrote. You have to try them out, and find the best way of ranking.

      Why 100? Why not 200, 500, 1000, 10000? What makes you think 100 is a good number? If you can't answer that question then why suggest it? You are buying into an invalid assertion made by the OP about how many filehandles you can really have open

        I guess there is sort of misunderstanding. You thought I was talking about the maximum number of file handlers the OS allowed, but I was not. This is not any sort of physical limitation or something. To me, it is not a good idea to create a list with potentially hugh unknown size, this is why this solution came in. You don't want to code for unknown, the unknown here is the resource used. An easy way to resolve this is to limit the array size.

        I do take your comment positively, and you made a good point that, this number shall not be greater than the OS allowed maximum number. However I do not suggest to reach the maximum.

        The OP has to find out a reasonable number by experimenting. I suggested to rank, because you want to reduce the number of open/close operation.

        "If you can't answer that question then why suggest it?"

        Just to be frank...It is quite okay for you rush to your assumption, which happened to be wrong in this case. Well, I do the same thing from time to time. But it was quite unneccessary for you to make comments that were not technical related. But never mind, no big deal ;-)

Re: Parse data into large number of output files.
by Roger (Parson) on Sep 29, 2004 at 12:31 UTC
    I never had to worry about using up too much memory these days because my E2900 has 32Gb of RAM, so I would always build an enormous hash in memory. 100Mb doesn't sound like too much data at all, I would build a hash in memory with the sender id as the key, then the messages in the array:

    my %log = ( 'user1' => [ 'msg1', 'msg2', ... ], 'user2' => ... );

    Nowadays even a half decent machines should have plenty of memory. I'd say just read into memory and write out one user at a time.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://394813]
Approved by davido
[oiskuu]: ... getlogin refers to tty owner; if you close STDIN, then getlogin returns undef. I get undef in my shell (running under X)
[tye]: Then there is a more persistent big of data similar to getlogin(). I don't believe even daemonizing drops it. So you can tell services that were started at boot time from those started by a person.
[tye]: I'll have to look into that later. getlogin() suites my current needs.

How do I use this? | Other CB clients
Other Users?
Others examining the Monastery: (10)
As of 2017-06-23 19:17 GMT
Find Nodes?
    Voting Booth?
    How many monitors do you use while coding?

    Results (554 votes). Check out past polls.