Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"

splitting a string that appears inconsistently in structure

by TheGorf (Novice)
on Jan 02, 2009 at 00:45 UTC ( #733688=perlquestion: print w/replies, xml ) Need Help??

TheGorf has asked for the wisdom of the Perl Monks concerning the following question:

Ok thanks to everyones recommendation here for Apache::LogParse I have managed to build a quite effective log parsing tool.


The logfiles all have the request part of the logfile in quotes as one string. So it looks like this normally:

GET /some/path.php?somevalue=eddie HTTP/1.1

but the problem is that it doesn't ALWAYS look like that. But somehow I need to slit that string into the Method, Request (before the ?), URI query (after the ?), and the protocol version. If I just split it based on the string above I of course run into the problem of "Use of uninitialized value in concatenation (.) or string" because of course the variables I split into get set as undefined due to the string not always appearing consistently. Sometimes it's lacking the protocol version, sometimes there isn't a URI query, sometimes the whole string is just one clump of characters due to some goofy attempt against the web server.

So I'm hoping someone can offer advice on how I can do this so that if the string doesn't contain all four components, the variable just gets set to "" or something.

so far I am stuck just doing something like this:
sub split_request( $ ) { my $http_request = $_[0]; my $method = ""; my $web_request = ""; my $request = ""; my $uri = ""; my $version = ""; ($method,$web_request) = split( / /,$http_request, 2 ); ($request,$version) = split( / /,$web_request, 2 ); ($request,$uri) = split( /\?/,$request ); }


Replies are listed 'Best First'.
Re: splitting a string that appears inconsistently in structure
by kyle (Abbot) on Jan 02, 2009 at 02:41 UTC

    First of all, it's usually a bad idea to use prototypes. See Far More Than Everything You've Ever Wanted to Know about Prototypes in Perl.

    Second, it might help if you could show some examples of the strings you're trying to deal with.

    In Perl 5.10, there's a "defined or" operator, "//". You can say this:

    $method //= '';

    That will set $method to the empty string if it's not defined. In earlier versions, you'd have to do this:

    $method = '' if ! defined $method;

    You could put this in a sub and use it like this...

    defined_or( $method, '' ); # changes the caller's variable sub defined_or { if ( ! defined $_[0] ) { $_[0] = $_[1]; } }

    You could probably cook up a pattern that would match all the things you want to match, but you'd still end up with variables set to undef when you're done.

    If you just want to suppress the warnings, you can say:

    no warnings 'uninitialized';

    It's good to restrict that to the smallest possible scope, however, so it's probably better to make sure everything is defined. See perllexwarn for the details.

Re: splitting a string that appears inconsistently in structure
by BUU (Prior) on Jan 02, 2009 at 03:04 UTC
    Use the module URI to process the URI Component of the log file. This will give you accessors to the various elements you want, such as ->path and ->query.
Re: splitting a string that appears inconsistently in structure
by fzellinger (Acolyte) on Jan 02, 2009 at 03:09 UTC
    Based on looking at Apache access log files for several years, I believe that we can rely on the following to be true (assuming we are using the default log format):
    1. The method is always present.
    2. The request URI is always present, and may or may not contain query params, but will never contain spaces
    3. The version may not be present.
    So, I propose that you split off the method+uri, treat the remainder as version and use the URI::Split module to break apart the URI:
    use URI::Split;
    sub split_request
        my @parts=split(/ /,$_[0]);
        scalar(@parts)>=2 or die "Bad request '$_[0]'";
        my $method=shift @parts;
        my $uri=shift @parts;
        my ($scheme, $auth, $path, $query, $frag) = uri_split($uri);
        my $protover=join(' ',@parts);
        return ($method,$scheme,$auth,$path,$query,$frag,$protover);
      Unfortunately I don't find those all to be true at all. For reference, here are some examples of entries that I have: - - [09/Jan/2008:03:45:10 -0800] "GET /core_level.cgi?cor +e=1 HTTP/1.1" 302 83 "-" "Mozilla/4.0 (compatible; MSIE 7.0; Windows +NT 5.1; .NET CLR 1.1.4322)" "" (call this a normal-ish request) - - [09/Jan/2008:02:20:39 -0800] "GET /home/eval_load.cgi?50 +" 200 2 "-" "-" "" (no version) - - [10/Jan/2008:02:18:58 -0800] "GET /" 200 752 "-" "-" "10 +.16.1.3" (no version, no ?, and nothing after the ?) - - [19/Jan/2008:03:45:06 -0800] "GGG99994" 200 752 "-" "-" +"" (here we have no method, no discernible request, and no v +ersion)
      Hence the need to figure out how to detect what is there. - - [19/Jan/2008:03:45:06 -0800] "GGG99994" 200 752 "-" "-" +""
        (here we have no method, no discernible request, and no version)

        Not exactly true. You have a method. It's just a really weird (and probably invalid) one. I'm not sure why your server would 200 it; I can only presume some slightly odd config.

        Take it in individual steps. First try splitting out into the 3 main pieces:

        my ($method, $uri, $proto, @extra) = split /\s+/, $request; die "Unexpected extra bits in request: @extra" if @extra > 0; die "No method" unless defined $method; # Or whatever other error-handling mechanism you want

        You shouldn't have any extra bits, becaue if you do, that means that your $method, $uri, $proto may not hold what you expect them to, so that needs error-checking.

        As well, you should have a method. The minimal possible HTTP request AFAIK would be a method of " ", with nothing else. That would leave all the vars undefined, and probably isn't something you care about anyway, so another error there.

        The protocol may not be there. But expect that in higher level code, or defined-or it to an empty string here if you prefer.

        That leaves the URI. Using URI::Split as suggested above in Re: splitting a string that appears inconsistently in structure would be better than trying to split it up manually. Imagine, for instance, the case of having a '?' in the password; a simple regexp would give you a wrong answer then.

        Note that the $uri can be undefined. A request of just "GET " is interpreted as "GET /" (similarly with POST), and would leave $uri undefined after that split. You probably want to make sure it's defined (as an empty string in this case) before you pass it to uri_split(). The URI::Split docs say:

        The $path part is always present (but can be the empty string) and is thus never returned as "undef".

        So take care not to blow up if it's empty. - - [19/Jan/2008:03:45:06 -0800] "GGG99994" 200 752 "-" "-" ""

        Is it real line from log? Status code is 200 Ok, so it looks like your apache successfully handled this request, though it shouldn't.

        I think it may be a good idea to handle malformed requests separately.

        Whats the format string for that log?

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://733688]
Approved by ikegami
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others perusing the Monastery: (3)
As of 2021-05-09 08:43 GMT
Find Nodes?
    Voting Booth?
    Perl 7 will be out ...

    Results (100 votes). Check out past polls.