Question about the most efficient way to read Apache log files without All-In-One Modules from CPAN (personal learning exercise)

lulz has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Question about the most efficient way to read Apache log files without All-In-One Modules from CPAN (personal learning exercise) by kcott (Archbishop) on Jun 17, 2015 at 07:19 UTC
G'day lulz, Welcome to the Monastery. Reading an entire logfile into memory prior to processing would be very much the exception; the norm would be to process the file a line at a time. The format of each log entry is defined in the Apache configuration file (`httpd.conf` or whatever you've called it). From my `httpd.conf`, here's the lines that describe the `access_log`: `LogFormat "%h %l %u %t \"%r\" %>s %b" common ... CustomLog "/private/var/log/apache2/access_log" common` [download] See the documentation in Apache Module mod_log_config for a description of the `%X` codes and other related information. With that information to hand, it's fairly easy to construct a regex to parse the log records. Here's a script to do that. The three `DATA` lines are taken verbatim from my `access_log` file. #!/usr/bin/env perl use strict; use warnings; # LogFormat "%h %l %u %t \"%r\" %>s %b" common my $re = qr{ ^ ( \S+ ) # capture remote host (%h) \s+ ( \S+ ) # capture remote logname (%l) \s+ ( \S+ ) # capture remote user (%u) \s+ \[ ( [^\]]+ ) # capture request time (%t) without br +ackets \] \s+ " ( (?: [^"\\]++ \| \\. )*+ ) # capture first line of request (%r) " \s+ ( \d+ ) # capture final status (%>s) \s+ ( \d+ ) # capture response size in bytes (%b) $ }x; my $format = join '', "Host: %s\n", "Logname: %s\n", "User: %s\n", "Time: %s\n", "Request: %s\n", "Status: %d\n", "Size: %d\n\n"; printf $format, /$re/ while <DATA>; __DATA__ 127.0.0.1 - - [22/Apr/2015:13:35:04 +1000] "GET /bin/admin.pl HTTP/1.1 +" 401 509 127.0.0.1 - ken [22/Apr/2015:13:35:21 +1000] "GET /bin/admin.pl HTTP/1 +.1" 500 656 127.0.0.1 - - [24/Apr/2015:04:51:49 +1000] "GET / HTTP/1.1" 200 45 [download] Output: `Host: 127.0.0.1 Logname: - User: - Time: 22/Apr/2015:13:35:04 +1000 Request: GET /bin/admin.pl HTTP/1.1 Status: 401 Size: 509 Host: 127.0.0.1 Logname: - User: ken Time: 22/Apr/2015:13:35:21 +1000 Request: GET /bin/admin.pl HTTP/1.1 Status: 500 Size: 656 Host: 127.0.0.1 Logname: - User: - Time: 24/Apr/2015:04:51:49 +1000 Request: GET / HTTP/1.1 Status: 200 Size: 45` [download] Be aware that your configuration may use other logfiles with different `LogFormat` directives; however, you should be able to contruct a suitable regex using the script above as a template. And, of course, you'll probably want to do something more useful than just printing the data. -- Ken	[reply] [d/l] [select]
Re: Question about the most efficient way to read Apache log files without All-In-One Modules from CPAN (personal learning exercise) by BrowserUk (Patriarch) on Jun 17, 2015 at 04:35 UTC
Apache log file formats (there are several) consist of space delimited variable width fields: `127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTT +P/1.0" 200 2326 "http://www.example.com/start.html" "Mozilla/4.08 [en +] (Win98; I ;Nav)"` [download] (The last two quote strings may not be present.) Every one of those fields except possibly the date and status code can vary in length; so pack and unpack which are primarily designed for fixed length fields and binary work, are the wrong tools for the job. split is the more applicable tool for your task: `my @bits = split ' ', $line;` [download] With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked m with torvalds on this	[reply] [d/l] [select]
Re^2: Question about the most efficient way to read Apache log files without All-In-One Modules from CPAN (personal learning exercise) by wrog (Friar) on Jun 17, 2015 at 04:58 UTC
`split` is fine as long as the character you're splitting on does not occur in your various fields. If it can, which then means there then has to be a scheme for quoting such fields or escaping such characters, then we're pretty much beyond what `split` can do and at the point where you need to be building the Regular Expression From Hell — which can be plenty fast if you do it right, but you have to do it right — or using Text::CSV or somesuch.	[reply] [d/l] [select]
Re^3: Question about the most efficient way to read Apache log files without All-In-One Modules from CPAN (personal learning exercise) by karlgoethebier (Abbot) on Jun 17, 2015 at 09:46 UTC
...we're pretty much beyond what split can do... Mh, we know the format: use Data::Dump; use feature qw(say); my $line =qq(127.0.0.1 - - [22/Apr/2015:13:35:04 +1000] "GET /bin/admi +n.pl HTTP/1.1" 401 509); my @bits = split /\s/, $line; dd\@bits; say qq(Host: $bits[0]); say qq(Logname: $bits[1]); say qq(User: $bits[2]); say qq(Time: $bits[3] $bits[4]); say qq(Request: $bits[5] $bits[6] $bits[7]); say qq(Status: $bits[8]); say qq(Size: $bits[9]); __END__ monks>apache.pl [ "127.0.0.1", "-", "-", "[22/Apr/2015:13:35:04", "+1000]", "\"GET", "/bin/admin.pl", "HTTP/1.1\"", 401, 509, ] Host: 127.0.0.1 Logname: - User: - Time: [22/Apr/2015:13:35:04 +1000] Request: "GET /bin/admin.pl HTTP/1.1" Status: 401 Size: 509 [download] Regards, Karl �The Crux of the Biscuit is the Apostrophe�	[reply] [d/l]
Re^4: Question about the most efficient way to read Apache log files without All-In-One Modules from CPAN (personal learning exercise) by lulz (Initiate) on Jun 17, 2015 at 19:13 UTC
Re^5: Question about the most efficient way to read Apache log files without All-In-One Modules from CPAN (personal learning exercise) by karlgoethebier (Abbot) on Jun 17, 2015 at 19:53 UTC
Some notes below your chosen depth have not been shown here
Re^4: Question about the most efficient way to read Apache log files without All-In-One Modules from CPAN (personal learning exercise) by wrog (Friar) on Jun 17, 2015 at 15:43 UTC
Re^5: Question about the most efficient way to read Apache log files without All-In-One Modules from CPAN (personal learning exercise) by karlgoethebier (Abbot) on Jun 17, 2015 at 17:28 UTC
Some notes below your chosen depth have not been shown here
Re^3: Question about the most efficient way to read Apache log files without All-In-One Modules from CPAN (personal learning exercise) by BrowserUk (Patriarch) on Jun 17, 2015 at 10:08 UTC
He did ask for a learning exercise; not a pre-solved solution. Plus, chances are the he'll need to break the composite fields down further anyway, before he can do any analysis or storage. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked	[reply]
Re^4: Question about the most efficient way to read Apache log files without All-In-One Modules from CPAN (personal learning exercise) by karlgoethebier (Abbot) on Jun 17, 2015 at 17:21 UTC
Re^5: Question about the most efficient way to read Apache log files without All-In-One Modules from CPAN (personal learning exercise) by BrowserUk (Patriarch) on Jun 17, 2015 at 18:10 UTC
Some notes below your chosen depth have not been shown here
Re: Question about the most efficient way to read Apache log files without All-In-One Modules from CPAN (personal learning exercise) by james28909 (Deacon) on Jun 17, 2015 at 03:09 UTC
I've never personally seen an apache logfile, care to post a link to one for an example? I am positive in as long as you can understand the format of the file, then it can be parsed. Pack and Unpack work great, i use it in almost every script i write, though i deal alot with binary information, and it is easier for me to parse through it if i can see the hexadecimal representation of the data. Either way, post that log file and see if we can figure out how it is formatted ;D EDIT: Though usually, if there is a module for it, then that will most likely be the fastest way to parse said file 99% of the time.	[reply]
Re: Question about the most efficient way to read Apache log files without All-In-One Modules from CPAN (personal learning exercise) by u65 (Chaplain) on Jun 17, 2015 at 11:49 UTC
Note the proper reference to the language is Perl , not PERL (ugh!).	[reply]
Re: Question about the most efficient way to read Apache log files without All-In-One Modules from CPAN (personal learning exercise) by Anonymous Monk on Jun 17, 2015 at 07:35 UTC
Whats the answer (personal learning exercise) Um, what? "exercise" means "do stuff" not "listen"	[reply]
A reply falls below the community's threshold of quality. You may see it by logging in.


good chemistry is complicated, and a little bit messy -LW
	PerlMonks