Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??
By this point, you'd think the topic of web server logfile parsing would be completely mined out, with all of the rough edges filed off, if not pounded completely flat. Something that showed up recently in my logfiles, coupled with what I've seen recently in print suggests that the topic still has some unmapped pitfalls.

Here's the familiar drill: To parse a "common log format" file (assuming you're doing it yourself), the conventional wisdom says to write:

while (<>) { my ($host, $ident_user, $auth_user, $date, $time, $time_zone, $method, $url, $protocol, $status, $bytes) = /^(\S+) (\S+) (\S+) \[([^:]+):(\d+:\d+:\d+) ([^\]]+\] "(\S+) (.+?) + (\S+)" (\S+) (\S+)$/; ...
Yawn. Been there, done that, right?

Maybe not. Let's take a second look at $auth_user. Unless you're using basic authentication to password protect pages, you'll see this in your logs as '-'. No problem there. And if you are using basic authentication, you'll see a username. No problem there... unless the username cannot contain whitespace, at which point the regexp fails to match. And since there's no check to see if it fails...

But can a username contain whitespace? Let's see.

% htpasswd .htpasswd 'd w s' New password: Re-type new password: Adding password for user d w s %
D'oh! RFC1945 says you aren't supposed be able to do this! (Update: RFC1945 is obsolete. RPF2617 suggests that embedded spaces are OK. Hm...)

The simple solution would seem to be "So, don't do that!", but here's where things get stranger. I've recently seen a case where somebody apparently presented an Authentication: header to a non-protected resource on my site, resulting in a bogus name appearing in the logs. I say "apparently" because I've been able to duplicate the behavior, and I can't think of any other way for the bogus username to have appeared. A minor annoyance, or the basis for a crude denial-of-accurate-service attack against log analysis software.

Fortunately, the solution is straightforward. All you have to do is change   /^(\S+) (\S+) (\S+) \[ ... to   /^(\S+) (\S+) (.+) \[ ... This makes the regex much less efficient, since it's going to backtrack to match '[', but it will resolve the problem, even if someone forges the username "dws [".


In reply to A rare, insidious logfile parsing pitfall by dws

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (4)
As of 2024-04-25 12:01 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found