Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Parsing a log file

by Anonymous Monk
on Jul 12, 2000 at 23:54 UTC ( [id://22262]=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hello fellow monks. I'm attempting to parse a log file that looks like the following excerpt (the actual size is thousands of lines long):

#Software: Microsoft Internet Information Server 4.0
#Version: 1.0
#Date: 2000-06-28 00:00:29
#Fields: time c-ip cs-method cs-uri-stem sc-status
00:00:29 192.168.20.50 GET /Plastic/ProductTop.html 200
00:00:29 192.168.20.50 GET /Plastic/images/ProductNavTopRt.jpg 304
00:00:29 192.168.20.50 GET /plastic/ProductNav.html 200
00:00:29 192.168.20.50 GET /plastic/images/ProductNav1b.jpg 304
00:00:29 192.168.20.50 GET /main.html 202

I understand how to open files and read individual lines from the file, and how to format the output, but I'm stuck on the parsing of individual sections of the line.

What I specifically am interested in is counting up every file type (eg- .gif, .html, .jpg) AND the times each file type has a unique code (eg- 304, 200, 202). So for the above snippet the output would look something like:

File typeCodeOccurences
.jpg3042
.html2002
.html2021

Of course, I would need to ignore any lines that did not have these codes (eg- the header lines). I've considered that using an array might be the best solution, but am lost on the algorithm necessary to achieve this.

Thanks for any help or ideas in the right direction.

-bri-

Replies are listed 'Best First'.
Re: Parsing a log file
by Ovid (Cardinal) on Jul 13, 2000 at 00:06 UTC
    You'll have to do the sorting yourself, but here's what I came up with (just a quick hack):
    #!/usr/bin/perl -w foreach (<DATA>) { next if /^#/; # Oops! Forgot this the first time :( my @results = split; $type = $1 if $results[3] =~ /(\.\w+)$/; $code = $results[4]; $output{$type}{$code}++; } foreach $type_key (keys %output) { foreach $code_key (keys %{$output{$type_key}}) { print "$type_key\t$code_key\t$output{$type_key}{$code_key}\n"; } } __DATA__ 00:00:29 192.168.20.50 GET /Plastic/ProductTop.html 200 00:00:29 192.168.20.50 GET /Plastic/images/ProductNavTopRt.jpg 304 00:00:29 192.168.20.50 GET /plastic/ProductNav.html 200 00:00:29 192.168.20.50 GET /plastic/images/ProductNav1b.jpg 304 00:00:29 192.168.20.50 GET /main.html 202
    The output was as follows:
    .html 200 2 .html 202 1 .jpg 304 2
    Cheers,
    Ovid

    Update: I'm seeing responses to this node which are almost, but not quite correct. If you read the question carefully, you'll notice that the type and code are not synonymous. You can't just lump them together and ++, nor can you simply count the instances of each type. Each type can have multiple codes and it's the instances of each code per type that the poster as looking for.

Re: Parsing a log file
by nuance (Hermit) on Jul 13, 2000 at 00:19 UTC

    You've said you know how to read the files and get the lines etc. so I'm just going to do the other bits.

    First you want to split the data to extract the path to the file and the number at the end. You can use split to do this:

    my @bits = split ' ', $line;

    This will give you and array with the path to the file in $bits[3] and the number in $bits[4]. Next you must extract the file type from the path to the file.

    $bits[3] =~ s/\.(\w+)$/$1/;

    I would then use a hash to store these:

    $count{$bits[3]}{$bits[4]}++;

    The first time you reference a type/ number combination it will be created with a value of one, each time after that the number gets incremented and so it keeps a count of how many times each combination occurred. At the end of the script you can get the data with:

    foreach $type (keys %count) { foreach (keys %count{$type}) { print "$type $_ $count{$type}{$_}"; } }

    Nuance

Re: Parsing a log file
by le (Friar) on Jul 13, 2000 at 00:07 UTC
    Maybe this will help you:
    my %suffix; open(LOG, "logfile") or die $!; while (<LOG>) { next if /^#/; $suffix{$1}++ if /.+(\.\w{3,4}\s\d{3})$/; } close LOGS; for (sort keys %suffix) { print "$_ => $suffix{$_}\n"; }
    This example supposes that the filetype consists of 3 or 4 literal letters and the unique codes consist of 3 numbers. Of course, every filetype should have it's own code.
Re: Parsing a log file
by ahunter (Monk) on Jul 13, 2000 at 00:16 UTC
    I think what you're looking for is to use a hash to store the values, and a regexp or two to parse the values:
    use strict; my %count = (); # Read the stuff from STDIN while (<STDIN>) { if (/GET .*(\.[^\s]+) ([0-9]{3})/) { my ($ftype, $reason) = ($1, $2); $count{"$ftype-$reason"}++; } } # Now print the stuff out foreach (keys(%count)) { if (/^(.*)-(.*)$/) { print "Filetype $1, reason $2 has a count of $count{$_}\n"; } }
    See perlre for details on regular expressions and perldata for information about hashes.

    There is also an evil hacky way of doing this, which is much faster, but requires you to know which filetypes and reason codes you are looking for.

    use strict; { local $/ = undef; # Slurp mode my $file = <STDIN>; my $html = $file=~s/\.html 200//g; print "Count of HTML 200 is $html\n"; }
    This works because s/// returns the number of items replaced, and is surprisingly fast at doing this on large files...

    Andrew.

Re: Parsing a log file
by Anonymous Monk on Jul 13, 2000 at 05:07 UTC
    Don't ignore the header lines. They tell you the order of the fields, which can change whenever someone reconfigures IIS to emit different information into the logs. Try something along these lines:
    my @fields = ();
    my %fields = ();
    
    while ( <LOG> ) {
        chomp;
        if ( m/^#/ ) {
            if ( s/^#Fields: // ) {
                @fields = split(/ /, $_);
            }
            next;
        }
    
        @fields{@fields} = split(/ /, $_);
    
        my $uri = $fields{'cs-uri-stem'};
    
        # The rest is left as an exercise
    
RE: Parsing a log file
by CMonster (Scribe) on Jul 13, 2000 at 02:24 UTC

    Here's one that should do the trick without splitting thousands of lines. I'm not sure if it's really faster, but in the spirit of TIMTOWTDI:

    open LOG, "test_log"; my %results; while (<LOG>) { /\.(\w+) (\d\d\d)$/ and $results{$1}{$2}++; } close LOG; foreach my $type (sort keys %results) { foreach my $code (sort keys %{$results{$type}}) { print "$type $code ".$results{$type}{$code}."\n"; } }

    The chicken tracks at the beginning are a regular expression that looks for the letters after the last . and the three-digit code at the end. It assumes that the comments at the beginning of the file won't have a three-digit code at the end.

RE: Parsing a log file (analog, webalizer)
by ybiC (Prior) on Jul 13, 2000 at 00:29 UTC
    If you're not set on "rolling your own", there are existing Open Source packages that may do what you're looking for.

    I use Analog and Webalizer myself. Both are available as source or binaries for *nix, Win32, etc. I can't say for sure if they deal with whatever logfile format IIS 4 uses, though.
        cheers,
        ybiC

Re: Parsing a log file
by Maclir (Curate) on Jul 13, 2000 at 02:09 UTC
    One thing that you may want to consider, is that the file type may not be after the first "." - you could have a file name of /product/widget.2.jpg for example. Sure, on an IIS based server, that is most unlikely.

    In that case, the regex to get the file type:

    $type = $1 if $results[3] =~ /(\.\w+)$/; # thanks, Ovid
    Would need to extract the last ".something". Sorry there is no code fragment - I am back in work after a really late night here - and the coffee hasn't done its stuff yet.

    Ken

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://22262]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others about the Monastery: (2)
As of 2025-03-20 02:59 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    When you first encountered Perl, which feature amazed you the most?










    Results (60 votes). Check out past polls.