Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Log regexp

by kazak (Beadle)
on Aug 03, 2012 at 13:45 UTC ( #985241=perlquestion: print w/ replies, xml ) Need Help??
kazak has asked for the wisdom of the Perl Monks concerning the following question:

Hi I'm trying to parse Apache's access.log string ( as lots of people before me, yes a saw Apache::LogParse, etc). I have two reasons for trying to do it: acess log is bit different from default one, so I need some regexp for parsing, and the second one I just want to understand where I'm wrong in order to improve my skills in perl.

So straight to the business.

133.133.133.133, 87.87.87.87 127.0.0.1 - - [21/Apr/2012:04:35:01 +0200 +] "GET /seo/vbseocp.php HTTP/1.0" 404 300 "-" "Internet Explorer 6.0" 95.95.95.95, 87.87.87.87 127.0.0.1 - - [22/Apr/2012:04:00:43 +0200] "G +ET / HTTP/1.0" 200 10211 "http://yandex.ru/yandsearch?text=example.co +m" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1;)"

And this is regexp

m/^(\S+)\, (\S+) (\S+) \- \- \[(\d{2})\/(\w+)\/(\d{4})\:(\d{2})\:(\d{2})\:(\d{2})\+(\d{4})\] \"\S+\" (\d{3}) (\d+|-) \"(.*?)\" \"\.*?\"$/

It would be great if someone showed me where I'm wrong and and how it should be done. Thanks in advance.

Comment on Log regexp
Select or Download Code
Re: Log regexp
by toolic (Chancellor) on Aug 03, 2012 at 13:51 UTC
      Thx for help, I'll use this.
Re: Log regexp
by Athanasius (Monsignor) on Aug 03, 2012 at 16:23 UTC

    Hello kazak,

    As you’ve no doubt discovered, the difficulty with a complex regex is that a single error can cause the whole match to fail, producing no output. One way to attack this kind of problem is to use a divide-and-conquer strategy by breaking down the regex into smaller, more manageable chunks. The split function can be useful here. Looking at your example log lines, it appears that each line can be usefully split on spaces:

    #! perl use strict; use warnings; use Data::Dumper; for ('133.133.133.133, 87.87.87.87 127.0.0.1 - - [21/Apr/2012:04:35:01 + +0200] "GET /seo/vbseocp.php HTTP/1.0" 404 300 "-" "Internet Explore +r 6.0"', '95.95.95.95, 87.87.87.87 127.0.0.1 - - [22/Apr/2012:04:00:43 +02 +00] "GET / HTTP/1.0" 200 10211 "http://yandex.ru/yandsearch?text=exam +ple.com" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1;)"') { my @parts = split; my %terms; $terms{ip1} = $parts[0] =~ s/ , $ //rx; $terms{ip2} = $parts[1]; $terms{ip3} = $parts[2]; # $parts[3] eq '-': discard # $parts[4] eq '-': discard @terms{qw(day month year hour min sec)} = $parts[5] =~ m! ^ \[ (\d{2}) / (\w+) / (\d{4}) : + (\d{2}) : (\d{2}) : (\d{2}) $ !x; ($terms{offset}) = $parts[6] =~ m! ^ \+ (\d{4}) \] $ !x; # $parts[7] eq '"GET': discard # $parts[8] eq '/...': discard # $parts[9] eq 'HTTP/1.0': discard $terms{int1} = $parts[10]; $terms{int2} = $parts[11]; my $rest = join(' ', @parts[12 .. $#parts]); $terms{strs} = []; push @{ $terms{strs} }, $1 while $rest =~ / ( \" [^\"]*? \" ) /gx; print Dumper(\%terms), "\n"; }

    Once you have this working, you can convert it back into a single regex if you really want to. But I don’t see that this would gain you anything.

    HTH,

    Athanasius <°(((><contra mundum

      Thx it helped.
Re: Log regexp
by hbm (Hermit) on Aug 03, 2012 at 17:00 UTC

    I took this:

    $_ = qq{133.133.133.133, 87.87.87.87 127.0.0.1 - - [21/Apr/2012:04:35: +01 +0200] "GET /seo/vbseocp.php HTTP/1.0" 404 300 "-" "Internet Explo +rer 6.0" 95.95.95.95, 87.87.87.87 127.0.0.1 - - [22/Apr/2012:04:00:43 + +0200] "GET / HTTP/1.0" 200 10211 "http://yandex.ru/yandsearch?text= +example.com" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; +)"}; m/^(\S+)\, (\S+) (\S+) \- \- \[(\d{2})\/(\w+)\/(\d{4})\:(\d{2})\:(\d{2 +})\:(\d{2})\+(\d{4})\] \"\S+\" (\d{3}) (\d+|-) \"(.*?)\" \"\.*?\"$/; print map { "[$_]\n" } $1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12,$13;

    And tweaked your expression down to this:

    m@^(\S+?), (\S+) (\S+) - - \[(\d{2})/(\w+)/(\d{4}):\s*(\d{2}):(\d{2}): +\s*(\d{2}) \+(\d{4})\] ".*?" (\d{3}) (\d+) "(.*?)" "(.*?)"@; # 1 2 +3 4 5 6 7
    1. I made \S non-greedy, to not gobble up the comma.
    2. I changed one literal space to \s* (Your example doesn't have spaces.)
    3. Same as #2.
    4. I changed "greedy non-whitespace" (which was failing b/c the string within has whitespace) to "non-greedy non-quotes"
    5. I don't know what you intended with "(\d+|-)"...
    6. Again, non-greedy non-quotes - I think that's what you wanted.
    7. I removed the end-of-line anchor ($), because you aren't currently matching to the end.

    Also note I changed the delimiter to something NOT in your content, so it doesn't need to be escaped. And I did NOT escape these other characters: -:"

    Update: My numbered notes line up if you click 'download'.

      Thanks for detailed explanation, it helped.
Re: Log regexp
by DamianConway (Beadle) on Aug 03, 2012 at 21:52 UTC
    If you're using Perl 5.10.1 or later, you may also find Regexp::Debugger useful for understanding where your regex isn't behaving as you expect. In this case, the module took about 2 seconds to show me that:
    \+(\d{4})
    in your regex (which is supposed to match the timezone within the timestamp) is failing to match the leading space of:
    <SPACE>+0200
    in your actual string.

    Damian

      Thx I fixed it.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://985241]
Approved by toolic
Front-paged by Old_Gray_Bear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (3)
As of 2014-07-26 13:48 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (177 votes), past polls