Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Question about the most efficient way to read Apache log files without All-In-One Modules from CPAN (personal learning exercise)

by lulz (Initiate)
on Jun 17, 2015 at 03:03 UTC ( #1130742=perlquestion: print w/replies, xml ) Need Help??
lulz has asked for the wisdom of the Perl Monks concerning the following question:

I have a question about the most efficient (in both memory and processing) way to read Apache log files. I realize that there are many many packages on CPAN that can read log files from Apache servers, but I am curious about what may be the best way to parse these files.

I am a beginner PERL programmer who is curious about the inner workings of both these packages and PERL itself.

I was thinking that it may be possible to use pack and unpack alongside File::Map somehow to be able to parse and process the log lines directly without loading the entire file into memory.

Is my line of thinking flawed? Is it possible to use pack and unpack in this manner with variable width files? I have read perlpacktut, but it only explains the capabilities of text files in a fixed width format.

Likewise, for the All-In-One packages for log parsing on CPAN, how do they parse these kinds of files and why?

  • Comment on Question about the most efficient way to read Apache log files without All-In-One Modules from CPAN (personal learning exercise)

Replies are listed 'Best First'.
Re: Question about the most efficient way to read Apache log files without All-In-One Modules from CPAN (personal learning exercise)
by kcott (Chancellor) on Jun 17, 2015 at 07:19 UTC

    G'day lulz,

    Welcome to the Monastery.

    Reading an entire logfile into memory prior to processing would be very much the exception; the norm would be to process the file a line at a time.

    The format of each log entry is defined in the Apache configuration file (httpd.conf or whatever you've called it). From my httpd.conf, here's the lines that describe the access_log:

    LogFormat "%h %l %u %t \"%r\" %>s %b" common ... CustomLog "/private/var/log/apache2/access_log" common

    See the documentation in Apache Module mod_log_config for a description of the %X codes and other related information.

    With that information to hand, it's fairly easy to construct a regex to parse the log records. Here's a script to do that. The three DATA lines are taken verbatim from my access_log file.

    #!/usr/bin/env perl use strict; use warnings; # LogFormat "%h %l %u %t \"%r\" %>s %b" common my $re = qr{ ^ ( \S+ ) # capture remote host (%h) \s+ ( \S+ ) # capture remote logname (%l) \s+ ( \S+ ) # capture remote user (%u) \s+ \[ ( [^\]]+ ) # capture request time (%t) without br +ackets \] \s+ " ( (?: [^"\\]++ | \\. )*+ ) # capture first line of request (%r) " \s+ ( \d+ ) # capture final status (%>s) \s+ ( \d+ ) # capture response size in bytes (%b) $ }x; my $format = join '', "Host: %s\n", "Logname: %s\n", "User: %s\n", "Time: %s\n", "Request: %s\n", "Status: %d\n", "Size: %d\n\n"; printf $format, /$re/ while <DATA>; __DATA__ 127.0.0.1 - - [22/Apr/2015:13:35:04 +1000] "GET /bin/admin.pl HTTP/1.1 +" 401 509 127.0.0.1 - ken [22/Apr/2015:13:35:21 +1000] "GET /bin/admin.pl HTTP/1 +.1" 500 656 127.0.0.1 - - [24/Apr/2015:04:51:49 +1000] "GET / HTTP/1.1" 200 45

    Output:

    Host: 127.0.0.1 Logname: - User: - Time: 22/Apr/2015:13:35:04 +1000 Request: GET /bin/admin.pl HTTP/1.1 Status: 401 Size: 509 Host: 127.0.0.1 Logname: - User: ken Time: 22/Apr/2015:13:35:21 +1000 Request: GET /bin/admin.pl HTTP/1.1 Status: 500 Size: 656 Host: 127.0.0.1 Logname: - User: - Time: 24/Apr/2015:04:51:49 +1000 Request: GET / HTTP/1.1 Status: 200 Size: 45

    Be aware that your configuration may use other logfiles with different LogFormat directives; however, you should be able to contruct a suitable regex using the script above as a template. And, of course, you'll probably want to do something more useful than just printing the data.

    -- Ken

Re: Question about the most efficient way to read Apache log files without All-In-One Modules from CPAN (personal learning exercise)
by BrowserUk (Pope) on Jun 17, 2015 at 04:35 UTC

    Apache log file formats (there are several) consist of space delimited variable width fields:

    127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTT +P/1.0" 200 2326 "http://www.example.com/start.html" "Mozilla/4.08 [en +] (Win98; I ;Nav)"

    (The last two quote strings may not be present.)

    Every one of those fields except possibly the date and status code can vary in length; so pack and unpack which are primarily designed for fixed length fields and binary work, are the wrong tools for the job.

    split is the more applicable tool for your task:

    my @bits = split ' ', $line;

    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
    In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked
    m with torvalds on this
      split is fine as long as the character you're splitting on does not occur in your various fields.

      If it can, which then means there then has to be a scheme for quoting such fields or escaping such characters, then we're pretty much beyond what split can do and at the point where you need to be building the Regular Expression From Hell — which can be plenty fast if you do it right, but you have to do it right — or using Text::CSV or somesuch.

        ...we're pretty much beyond what split can do...

        Mh, we know the format:

        use Data::Dump; use feature qw(say); my $line =qq(127.0.0.1 - - [22/Apr/2015:13:35:04 +1000] "GET /bin/admi +n.pl HTTP/1.1" 401 509); my @bits = split /\s/, $line; dd\@bits; say qq(Host: $bits[0]); say qq(Logname: $bits[1]); say qq(User: $bits[2]); say qq(Time: $bits[3] $bits[4]); say qq(Request: $bits[5] $bits[6] $bits[7]); say qq(Status: $bits[8]); say qq(Size: $bits[9]); __END__ monks>apache.pl [ "127.0.0.1", "-", "-", "[22/Apr/2015:13:35:04", "+1000]", "\"GET", "/bin/admin.pl", "HTTP/1.1\"", 401, 509, ] Host: 127.0.0.1 Logname: - User: - Time: [22/Apr/2015:13:35:04 +1000] Request: "GET /bin/admin.pl HTTP/1.1" Status: 401 Size: 509

        Regards, Karl

        «The Crux of the Biscuit is the Apostrophe»

        He did ask for a learning exercise; not a pre-solved solution.

        Plus, chances are the he'll need to break the composite fields down further anyway, before he can do any analysis or storage.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
        In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked
Re: Question about the most efficient way to read Apache log files without All-In-One Modules from CPAN (personal learning exercise)
by james28909 (Chaplain) on Jun 17, 2015 at 03:09 UTC
    I've never personally seen an apache logfile, care to post a link to one for an example? I am positive in as long as you can understand the format of the file, then it can be parsed. Pack and Unpack work great, i use it in almost every script i write, though i deal alot with binary information, and it is easier for me to parse through it if i can see the hexadecimal representation of the data.
    Either way, post that log file and see if we can figure out how it is formatted ;D

    EDIT: Though usually, if there is a module for it, then that will most likely be the fastest way to parse said file 99% of the time.
Re: Question about the most efficient way to read Apache log files without All-In-One Modules from CPAN (personal learning exercise)
by u65 (Chaplain) on Jun 17, 2015 at 11:49 UTC

    Note the proper reference to the language is Perl , not PERL (ugh!).

Re: Question about the most efficient way to read Apache log files without All-In-One Modules from CPAN (personal learning exercise)
by Anonymous Monk on Jun 17, 2015 at 07:35 UTC

    Whats the answer (personal learning exercise)

    Um, what?

    "exercise" means "do stuff" not "listen"

Re: Question about the most efficient way to read Apache log files without All-In-One Modules from CPAN (personal learning exercise)
by sundialsvc4 (Abbot) on Jun 17, 2015 at 13:54 UTC

    Also:   “Yay!!   This is open source!   Therefore, you can see for yourself!”   :-)

    Simply look-up any of those modules on http://search.cpan.org, then click on the Source hyperlink at the top of the page, next to the version-number.   Presto:   there’s the source.   Wanna know what they did and how they did it?   There are no secrets.   Wanna look at the test-suite that runs anytime the package is installed on a computer?   No secrets.

    Now, you might well have stumbled-upon a package which is actually part of another package, such that “the source” actually consists of the whole package.   Well, there’s a hyperlink to that, too ... closer yet to the top of the page, next to the author’s name.

    The source-code to any installed package can also be found in the library directories of your computer.   The PERL5LIB environment-variable (or its equivalent, found in some control-panel, in Windows), will tell you where.

    Some libraries actually use a combination of “pure Perl” and C/C++ subroutines, in a technique called “XS.”   Nevertheless, all the relevant source-code should be right there, along with the “magick glue” that links the two together.

    While you’re getting-to-know Perl, another “must have” CPAN module (family ...) to be aware of is:   Regexp::Common.   (Definitely “click next to the author’s name” on this one!   It’s big!)   A library of many hundreds of commonly-used regular expressions, all of them well-tested so that you don’t have to.

    Although it is, indeed, very educational to “learn from the working examples of others,” it’s also important to “do not do a thing already done.”   CPAN-provided solutions are frequently very thorough, very complete, and very tested.   One might cautiously say that “the Perl language, itself,” is rather ordinary . . . but “the CPAN library” is one of the biggest-and-best in our industry.   “What’s all the fuss about, really?”   To me, the answer is:   CPAN.   Learn it well, use it often.

    Welcome to the Monastery!

      "The source-code to any installed package can also be found in the library directories of your computer. The PERL5LIB environment-variable (or its equivalent, found in some control-panel, in Windows), will tell you where."

      No, that's wrong!

      @INC contains the directories. These directories include whatever's in $PERL5LIB if it's been set or in $PERLLIB if that's been set.

      This is documented in perlrun: ENVIRONMENT:

      PERL5LIB
      A list of directories in which to look for Perl library files before looking in the standard library and the current directory.
      ...
      If PERL5LIB is not defined, PERLLIB is used. ...
      PERLLIB
      A list of directories in which to look for Perl library files before looking in the standard library and the current directory. If PERL5LIB is defined, PERLLIB is not used. ...

      You can confirm this with perl -V (look under %ENV: and @INC: at the end of the output).

      Please check your information before posting.

      -- Ken

      The source-code to any installed package can also be found in the library directories of your computer. The PERL5LIB environment-variable (or its equivalent, found in some control-panel, in Windows), will tell you where.

      Uhm...no. I can assure you CPAN modules are not installed in the directory where my personal Perl modules are stored; but rather, in the lib directory under my various Perl installations. While this might be hinted at by the PATH environment variable (as seen in the example below), this is not assured. You can install Perl in a directory called M:\Booger.x , if you were so inclined.

      Also, under Windows, you needn't go to the System Control Panel Environment options screen to examine environment variables; they are still available in the command shell in much the same way they've always been, using the SET command:

            D:\PerlMonks>set | grep -i "perl"
      Path=C:\Steve\Utils;C:\cygwin\bin;C:\App\Java\jdk1.7.0_55\bin;C:\Perl\Perl-5.18.2.1802\site\bin;C:\Perl\Perl-5.18.2.1802\bin;c:\Program Files (x86)\Intel\iCLS Client\;c:\Program Files\Intel\iCLS Client\;C:\windows\system32;C:\windows;C:\windows\System32\Wbem;C:\windows\System32\WindowsPowerShell\v1.0\;C:\Program Files\Intel\Intel(R) Management Engine Components\DAL;C:\Program Files\Intel\Intel(R) Management Engine Components\IPT;C:\Program Files (x86)\Intel\Intel(R) Management Engine Components\DAL;C:\Program Files (x86)\Intel\Intel(R) Management EngineComponents\IPT;C:\Program Files (x86)\Intel\OpenCL SDK\3.0\bin\x86;C:\Program Files (x86)\Intel\OpenCL SDK\3.0\bin\x64;C:\Program Files (x86)\Microsoft SQL Server\110\Tools\Binn\;C:\Program Files\Microsoft SQL Server\110\Tools\Binn\;C:\Program Files\Microsoft SQL Server\110\DTS\Binn\;C:\Program Files (x86)\Microsoft SQL Server\110\Tools\Binn\ManagementStudio\;C:\Program Files (x86)\Microsoft SQL Server\110\DTS\Binn\
      PERL5LIB=C:/Steve/Perl
      D:\PerlMonks>
           

      Update: This in contrast to the @INC values mentioned by kcott above:

           
      D:\PerlMonks>perl -e "print qq(@INC)" C:/Steve/Perl C:/Perl/Perl-5.18.2.1802/site/lib C:/Perl/Perl-5.18.2.18 +02/lib . D:\PerlMonks>
           

      it’s also important to “do not do a thing already done.”

      Who are you quoting there?

      I'd like to look up why they think it is important to not do a thing already done.

        See also "re-inventing the wheel." This applies less to a learning exercise than production code produced under time pressure.

        Dum Spiro Spero

        Frankly, for a beginner (which is ostensibly one of our key target audiences), I would encourage doing a thing already done, so you can compare how you approached the problem to those who have come before you.

        Most people learn by doing; or, perhaps more accurate to say that they cement their learnings by doing.

        So I, for one, take the opposite stance: It is vitally important for most beginners to do many things already done. That's how they develop their art.

        Plus, on the everpresent practical note, sometimes you can't use CPAN modules. We've hashed that over and over here at the Monastary for years; but the truth is sometimes you don't have the option to drink deeply from the CPAN well.

        At which point you need to have your art reasonably well-developed, as you will be re-inventing that wheel, and it sure would be nice if you made a decent showing of yourself in the process.

      Thank you for your reply!

      I have been looking at Apache ParseLog from CPAN and then read the comments. Then i switched over to App::YG::Apache, whose parser is much much more elegant and readable, but limited to one case. I know from my limited experience that the Perl regex machine is extremely fast when it comes to finding matches, but can take quite a long time for failures, so I was thinking, for a more general case, if there was yet another way to parse these files that might be faster without using regular expressions that I could play around with.

      “What’s all the fuss about, really?”

      Indeed.

      «The Crux of the Biscuit is the Apostrophe»

        You're on dangerous ground here karlgoethebier , don't paradox you'll destroy the universe!

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1130742]
Approved by kcott
help
Chatterbox?
[choroba]: perl -e 'print chr 123'

How do I use this? | Other CB clients
Other Users?
Others taking refuge in the Monastery: (5)
As of 2017-12-13 10:29 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    What programming language do you hate the most?




















    Results (355 votes). Check out past polls.

    Notices?