Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling

A rare, insidious logfile parsing pitfall

by dws (Chancellor)
on Oct 27, 2001 at 06:55 UTC ( #121721=perlmeditation: print w/replies, xml ) Need Help??

By this point, you'd think the topic of web server logfile parsing would be completely mined out, with all of the rough edges filed off, if not pounded completely flat. Something that showed up recently in my logfiles, coupled with what I've seen recently in print suggests that the topic still has some unmapped pitfalls.

Here's the familiar drill: To parse a "common log format" file (assuming you're doing it yourself), the conventional wisdom says to write:

while (<>) { my ($host, $ident_user, $auth_user, $date, $time, $time_zone, $method, $url, $protocol, $status, $bytes) = /^(\S+) (\S+) (\S+) \[([^:]+):(\d+:\d+:\d+) ([^\]]+\] "(\S+) (.+?) + (\S+)" (\S+) (\S+)$/; ...
Yawn. Been there, done that, right?

Maybe not. Let's take a second look at $auth_user. Unless you're using basic authentication to password protect pages, you'll see this in your logs as '-'. No problem there. And if you are using basic authentication, you'll see a username. No problem there... unless the username cannot contain whitespace, at which point the regexp fails to match. And since there's no check to see if it fails...

But can a username contain whitespace? Let's see.

% htpasswd .htpasswd 'd w s' New password: Re-type new password: Adding password for user d w s %
D'oh! RFC1945 says you aren't supposed be able to do this! (Update: RFC1945 is obsolete. RPF2617 suggests that embedded spaces are OK. Hm...)

The simple solution would seem to be "So, don't do that!", but here's where things get stranger. I've recently seen a case where somebody apparently presented an Authentication: header to a non-protected resource on my site, resulting in a bogus name appearing in the logs. I say "apparently" because I've been able to duplicate the behavior, and I can't think of any other way for the bogus username to have appeared. A minor annoyance, or the basis for a crude denial-of-accurate-service attack against log analysis software.

Fortunately, the solution is straightforward. All you have to do is change   /^(\S+) (\S+) (\S+) \[ ... to   /^(\S+) (\S+) (.+) \[ ... This makes the regex much less efficient, since it's going to backtrack to match '[', but it will resolve the problem, even if someone forges the username "dws [".

Replies are listed 'Best First'.
Re: A rare, insidious logfile parsing pitfall
by echo (Pilgrim) on Oct 27, 2001 at 14:57 UTC
    I've been using user names with embedded spaces for a long time now. You're quoting RFC 1945 which has been obsoleted twice, the latest is RFC 2616, I haven't checked whether it changes the rules though. I'm not sure it matters much, because Apache does not escape anything when writing to the logs, there's no untaiting of user supplied fields such as the request URI, the Referer header or the User Agent header. It's been known for quite a while that these can fool a human reading the logs from a shell with 'cat' or 'tail', e.g. disrupting display by embedding VT control sequences in one of those fields.
    Anyone thinking such a log can be parsed with regexps is in for a surprise... Recently the Apache dev list has discussed the possibility of providing a switch that would properly escape fields written to the log.
Re: A rare, insidious logfile parsing pitfall
by Fletch (Bishop) on Oct 28, 2001 at 07:32 UTC

    You might have a problem if another bracket shows up later on in the line for some reason. Try this:

    /^(\S+) (\S+) ([^\[]+) \[...

    Of course if you're using Apache just use the CustomLog directive to define your own easily parsed log file format and be done with it.

      Inside a character class, you don't need to escape '['. Your class above can legally be written as [^[]. You do need to escape ']' inside a character class, however, otherwise it will be interpreted as the closing bracket of your class. So, [^a]] means "any character other than 'a' followed by a ']'."

      Its the exact opposite elsewere in the regex where an unescaped ']' is literal but '[' is not. To avoid mistakes, I usually escape both of them to get their literal interpretation no matter what their context.

      Just a clarification, not necessarily a correction.


Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlmeditation [id://121721]
Approved by root
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others rifling through the Monastery: (6)
As of 2022-05-24 15:15 GMT
Find Nodes?
    Voting Booth?
    Do you prefer to work remotely?

    Results (84 votes). Check out past polls.