Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

Question about the most efficient way to read Apache log files without All-In-One Modules from CPAN (personal learning exercise)

by lulz (Initiate)
on Jun 17, 2015 at 03:03 UTC ( #1130742=perlquestion: print w/replies, xml ) Need Help??
lulz has asked for the wisdom of the Perl Monks concerning the following question:

I have a question about the most efficient (in both memory and processing) way to read Apache log files. I realize that there are many many packages on CPAN that can read log files from Apache servers, but I am curious about what may be the best way to parse these files.

I am a beginner PERL programmer who is curious about the inner workings of both these packages and PERL itself.

I was thinking that it may be possible to use pack and unpack alongside File::Map somehow to be able to parse and process the log lines directly without loading the entire file into memory.

Is my line of thinking flawed? Is it possible to use pack and unpack in this manner with variable width files? I have read perlpacktut, but it only explains the capabilities of text files in a fixed width format.

Likewise, for the All-In-One packages for log parsing on CPAN, how do they parse these kinds of files and why?

  • Comment on Question about the most efficient way to read Apache log files without All-In-One Modules from CPAN (personal learning exercise)

Replies are listed 'Best First'.
Re: Question about the most efficient way to read Apache log files without All-In-One Modules from CPAN (personal learning exercise)
by kcott (Chancellor) on Jun 17, 2015 at 07:19 UTC

    G'day lulz,

    Welcome to the Monastery.

    Reading an entire logfile into memory prior to processing would be very much the exception; the norm would be to process the file a line at a time.

    The format of each log entry is defined in the Apache configuration file (httpd.conf or whatever you've called it). From my httpd.conf, here's the lines that describe the access_log:

    LogFormat "%h %l %u %t \"%r\" %>s %b" common ... CustomLog "/private/var/log/apache2/access_log" common

    See the documentation in Apache Module mod_log_config for a description of the %X codes and other related information.

    With that information to hand, it's fairly easy to construct a regex to parse the log records. Here's a script to do that. The three DATA lines are taken verbatim from my access_log file.

    #!/usr/bin/env perl use strict; use warnings; # LogFormat "%h %l %u %t \"%r\" %>s %b" common my $re = qr{ ^ ( \S+ ) # capture remote host (%h) \s+ ( \S+ ) # capture remote logname (%l) \s+ ( \S+ ) # capture remote user (%u) \s+ \[ ( [^\]]+ ) # capture request time (%t) without br +ackets \] \s+ " ( (?: [^"\\]++ | \\. )*+ ) # capture first line of request (%r) " \s+ ( \d+ ) # capture final status (%>s) \s+ ( \d+ ) # capture response size in bytes (%b) $ }x; my $format = join '', "Host: %s\n", "Logname: %s\n", "User: %s\n", "Time: %s\n", "Request: %s\n", "Status: %d\n", "Size: %d\n\n"; printf $format, /$re/ while <DATA>; __DATA__ 127.0.0.1 - - [22/Apr/2015:13:35:04 +1000] "GET /bin/admin.pl HTTP/1.1 +" 401 509 127.0.0.1 - ken [22/Apr/2015:13:35:21 +1000] "GET /bin/admin.pl HTTP/1 +.1" 500 656 127.0.0.1 - - [24/Apr/2015:04:51:49 +1000] "GET / HTTP/1.1" 200 45

    Output:

    Host: 127.0.0.1 Logname: - User: - Time: 22/Apr/2015:13:35:04 +1000 Request: GET /bin/admin.pl HTTP/1.1 Status: 401 Size: 509 Host: 127.0.0.1 Logname: - User: ken Time: 22/Apr/2015:13:35:21 +1000 Request: GET /bin/admin.pl HTTP/1.1 Status: 500 Size: 656 Host: 127.0.0.1 Logname: - User: - Time: 24/Apr/2015:04:51:49 +1000 Request: GET / HTTP/1.1 Status: 200 Size: 45

    Be aware that your configuration may use other logfiles with different LogFormat directives; however, you should be able to contruct a suitable regex using the script above as a template. And, of course, you'll probably want to do something more useful than just printing the data.

    -- Ken

Re: Question about the most efficient way to read Apache log files without All-In-One Modules from CPAN (personal learning exercise)
by BrowserUk (Pope) on Jun 17, 2015 at 04:35 UTC

    Apache log file formats (there are several) consist of space delimited variable width fields:

    127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTT +P/1.0" 200 2326 "http://www.example.com/start.html" "Mozilla/4.08 [en +] (Win98; I ;Nav)"

    (The last two quote strings may not be present.)

    Every one of those fields except possibly the date and status code can vary in length; so pack and unpack which are primarily designed for fixed length fields and binary work, are the wrong tools for the job.

    split is the more applicable tool for your task:

    my @bits = split ' ', $line;

    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
    In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked
    m with torvalds on this
      split is fine as long as the character you're splitting on does not occur in your various fields.

      If it can, which then means there then has to be a scheme for quoting such fields or escaping such characters, then we're pretty much beyond what split can do and at the point where you need to be building the Regular Expression From Hell — which can be plenty fast if you do it right, but you have to do it right — or using Text::CSV or somesuch.

        ...we're pretty much beyond what split can do...

        Mh, we know the format:

        use Data::Dump; use feature qw(say); my $line =qq(127.0.0.1 - - [22/Apr/2015:13:35:04 +1000] "GET /bin/admi +n.pl HTTP/1.1" 401 509); my @bits = split /\s/, $line; dd\@bits; say qq(Host: $bits[0]); say qq(Logname: $bits[1]); say qq(User: $bits[2]); say qq(Time: $bits[3] $bits[4]); say qq(Request: $bits[5] $bits[6] $bits[7]); say qq(Status: $bits[8]); say qq(Size: $bits[9]); __END__ monks>apache.pl [ "127.0.0.1", "-", "-", "[22/Apr/2015:13:35:04", "+1000]", "\"GET", "/bin/admin.pl", "HTTP/1.1\"", 401, 509, ] Host: 127.0.0.1 Logname: - User: - Time: [22/Apr/2015:13:35:04 +1000] Request: "GET /bin/admin.pl HTTP/1.1" Status: 401 Size: 509

        Regards, Karl

        «The Crux of the Biscuit is the Apostrophe»

        He did ask for a learning exercise; not a pre-solved solution.

        Plus, chances are the he'll need to break the composite fields down further anyway, before he can do any analysis or storage.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
        In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked
Re: Question about the most efficient way to read Apache log files without All-In-One Modules from CPAN (personal learning exercise)
by james28909 (Deacon) on Jun 17, 2015 at 03:09 UTC
    I've never personally seen an apache logfile, care to post a link to one for an example? I am positive in as long as you can understand the format of the file, then it can be parsed. Pack and Unpack work great, i use it in almost every script i write, though i deal alot with binary information, and it is easier for me to parse through it if i can see the hexadecimal representation of the data.
    Either way, post that log file and see if we can figure out how it is formatted ;D

    EDIT: Though usually, if there is a module for it, then that will most likely be the fastest way to parse said file 99% of the time.
Re: Question about the most efficient way to read Apache log files without All-In-One Modules from CPAN (personal learning exercise)
by u65 (Chaplain) on Jun 17, 2015 at 11:49 UTC

    Note the proper reference to the language is Perl , not PERL (ugh!).

Re: Question about the most efficient way to read Apache log files without All-In-One Modules from CPAN (personal learning exercise)
by Anonymous Monk on Jun 17, 2015 at 07:35 UTC

    Whats the answer (personal learning exercise)

    Um, what?

    "exercise" means "do stuff" not "listen"

    A reply falls below the community's threshold of quality. You may see it by logging in.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1130742]
Approved by kcott
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others studying the Monastery: (8)
As of 2018-10-23 16:13 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    When I need money for a bigger acquisition, I usually ...














    Results (125 votes). Check out past polls.

    Notices?