Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

Acolyte needs regex help

by Dungan (Initiate)
on Jun 23, 2000 at 18:12 UTC ( #19584=perlquestion: print w/replies, xml ) Need Help??

Dungan has asked for the wisdom of the Perl Monks concerning the following question:

Great masters I have worked on this piece of small code for awhile and have started to pull my hair out... soon I will actually look like one of the bald monks. I am trying to parse extended log files into variable that I will then pass to a SQL database. For now the problem is the regex I am using. Here is a sample record:
209.19.170.94 - - [21/Jun/2000:00:06:04 -0400] "GET /ob/html/meet.html + HTTP/1.1" 200 5933 "http://www.stuff.com/ob/" "Mozilla/4.0 (compatib +le; MSIE 5.0; Mac_PowerPC)"
(all one continous line) Here is my code:
#! /usr/bin/perl -w open(FILE, "/logs/000604access.log"); while (<FILE>) { ($client,$identuser,$authuser,$date,$tz,$method,$url,$protocol,$s +tatus,$bytes,$refer,$platform,$extendedinfo) = /^(\S+) ( \S+) (\S+) \[(\S+) (\S+)\] "(\S+) (\S+) (\S+)" (\S+) (\S+) "(\S+)" "(\ +S) (.*?)"$/; ### do some schmacity with variables } close FILE; exit;
my problem is that it tells me all my variable are uninitialized and if I print them they are all null values.

Replies are listed 'Best First'.
Re: Acolyte needs regex help
by davorg (Chancellor) on Jun 23, 2000 at 18:42 UTC
Re: Acolyte needs regex help
by Corion (Pope) on Jun 23, 2000 at 18:24 UTC

    From what I see at first glance, you seem to be missing a + on the last (\S) sequence.

    I always debug my REs by deleting enough stuff such that it matches again. Then I add step by step the other stuff back in.

    In your case, you might want to rewrite the code such that it logs "bad" lines into a separate file (untested code !):

    Update :Shendal noted below that I didn't fix the problem with the RE, and that's correct. I have now modified the RE so that it should match :)

    #!/usr/bin/perl -w use strict; my $filename = "access.log"; my ($client,$identuser,$authuser,$date,$tz,$method,$url,$protocol,$sta +tus,$bytes,$refer,$platform,$extendedinfo) open( FILE, "< $filename" ) or die ("Couldn't open $filename : !$\n" ) +; while (<FILE>) { if ( /^(\S+) (\S+) (\S+) \[(\S+) (\S+)\] "(\S+) (\S+) (\S+)" (\S+) ( +\S+) "(\S+)" "(\S+) (.*?)"$/ ) { ($client,$identuser,$authuser,$date,$tz,$method,$url,$protocol,$st +atus,$bytes,$refer,$platform,$extendedinfo) = ($1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12,$13); } else { print "Unmatched log entry : $_"; }; }; close FILE;

      In your (admittedly untested) code, you didn't fix the \S+ in your regex. Wasn't that the point? :-)
      Thanks you all much I came up with this as my final code <CODE> #! /usr/bin/perl -w use strict; my $goodFILE = "/logs/goodstring.txt"; my $badFILE = "/logs/badstring.txt"; my $logFILE = "/logs/processing/uexpress/access.20000622"; open(GOOD , ">$goodFILE") || die "cannot create $goodFILE : !$\n"; open(BAD , ">$badFILE") || die "cannot create $badFILE : !$\n"; open(FILE, "$logFILE"); while (<FILE>) { if ( /^(\S+) (\S+) (\S+) \[(^:+):(\d+:\d+:\d+) (\S+)\] "(\S+) (\S+) (\S+)" (\S+) (\S+) "(.*?)" "(.*?)"$/) { my ($client,$identuser,$authuser,$date,$time,$tz,$method,$url,$protocol,$status,$bytes,$refer,$platform) = ($1,$2,$3 ,$4,$5,$6,$7,$8,$9,$10,$11,$12,$13); print GOOD "client : $client\n"; print GOOD "userid : $identuser\n"; print GOOD "authuser : $authuser\n"; print GOOD "date : $date\n"; print GOOD "time : $time\n"; print GOOD "time zone : $tz\n"; print GOOD "method : $method\n"; print GOOD "URL : $url\n"; print GOOD "protocol : $protocol\n"; print GOOD "status : $status\n"; print GOOD "byte
      Thanks you all much I came up with this as my final code <CODE> #! /usr/bin/perl -w use strict; my $goodFILE = "/logs/goodstring.txt"; my $badFILE = "/logs/badstring.txt"; my $logFILE = "/logs/processing/uexpress/access.20000622"; open(GOOD , ">$goodFILE") || die "cannot create $goodFILE : !$\n"; open(BAD , ">$badFILE") || die "cannot create $badFILE : !$\n"; open(FILE, "$logFILE"); while (<FILE>) { if ( /^(\S+) (\S+) (\S+) \[(^:+):(\d+:\d+:\d+) (\S+)\] "(\S+) (\S+) (\S+)" (\S+) (\S+) "(.*?)" "(.*?)"$/) { my ($client,$identuser,$authuser,$date,$time,$tz,$method,$url,$protocol,$status,$bytes,$refer,$platform) = ($1,$2,$3 ,$4,$5,$6,$7,$8,$9,$10,$11,$12,$13); print GOOD "client : $client\n"; print GOOD "userid : $identuser\n"; print GOOD "authuser : $authuser\n"; print GOOD "date : $date\n"; print GOOD "time : $time\n"; print GOOD "time zone : $tz\n"; print GOOD "method : $method\n"; print GOOD "URL : $url\n"; print GOOD "protocol : $protocol\n"; print GOOD "status : $status\n"; print GOOD "byte
      Thanks much for the help <CODE>#! /usr/bin/perl -w use strict; my $goodFILE = "/logs/goodstring.txt"; my $badFILE = "/logs/badstring.txt"; my $logFILE = "/logs/processing/uexpress/access.20000622"; open(GOOD , ">$goodFILE") || die "cannot create $goodFILE : !$\n"; open(BAD , ">$badFILE") || die "cannot create $badFILE : !$\n"; open(FILE, "$logFILE"); while (<FILE>) { if ( /^(\S+) (\S+) (\S+) \[(^:+):(\d+:\d+:\d+) (\S+)\] "(\S+) (\S+) (\S+)" (\S+) (\S+) "(.*?)" "(.*?)"$/) { my ($client,$identuser,$authuser,$date,$time,$tz,$method,$url,$protocol,$status,$bytes,$refer,$platform) = ($1,$2,$3 ,$4,$5,$6,$7,$8,$9,$10,$11,$12,$13); print GOOD "client : $client\n"; print GOOD "userid : $identuser\n"; print GOOD "authuser : $authuser\n"; print GOOD "date : $date\n"; print GOOD "time : $time\n"; print GOOD "time zone : $tz\n"; print GOOD "method : $method\n"; print GOOD "URL : $url\n"; print GOOD "protocol : $protocol\n"; print GOOD "status : $status\n"; print GOOD "bytes : $bytes\n";
      Thanks much for the help <CODE> #! /usr/bin/perl -w use strict; my $goodFILE = "/logs/goodstring.txt"; my $badFILE = "/logs/badstring.txt"; my $logFILE = "/logs/processing/uexpress/access.20000622"; open(GOOD , ">$goodFILE") || die "cannot create $goodFILE : !$\n"; open(BAD , ">$badFILE") || die "cannot create $badFILE : !$\n"; open(FILE, "$logFILE"); while (<FILE>) { if ( /^(\S+) (\S+) (\S+) \[(^:+):(\d+:\d+:\d+) (\S+)\] "(\S+) (\S+) (\S+)" (\S+) (\S+) "(.*?)" "(.*?)"$/) { my ($client,$identuser,$authuser,$date,$time,$tz,$method,$url,$protocol,$status,$bytes,$refer,$platform) = ($1,$2,$3 ,$4,$5,$6,$7,$8,$9,$10,$11,$12,$13); print GOOD "client : $client\n"; print GOOD "userid : $identuser\n"; print GOOD "authuser : $authuser\n"; print GOOD "date : $date\n"; print GOOD "time : $time\n"; print GOOD "time zone : $tz\n"; print GOOD "method : $method\n"; print GOOD "URL : $url\n"; print GOOD "protocol : $protocol\n"; print GOOD "status : $status\n"; print GOOD "bytes : $bytes\n";
      Thanks much for the help <CODE> #! /usr/bin/perl -w use strict; my $goodFILE = "/logs/goodstring.txt"; my $badFILE = "/logs/badstring.txt"; my $logFILE = "/logs/processing/access.20000622"; open(GOOD , ">$goodFILE") || die "cannot create $goodFILE : !$\n"; open(BAD , ">$badFILE") || die "cannot create $badFILE : !$\n"; open(FILE, "$logFILE"); while (<FILE>) { if ( /^(\S+) (\S+) (\S+) \[(^:+):(\d+:\d+:\d+) (\S+)\] "(\S+) (\S+) (\S+)" (\S+) (\S+) "(.*?)" "(.*?)"$/) { my ($client,$identuser,$authuser,$date,$time,$tz,$method,$url,$protocol,$status,$bytes,$refer,$platform) = ($1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12,$13); print GOOD "client : $client\n"; print GOOD "userid : $identuser\n"; print GOOD "authuser : $authuser\n"; print GOOD "date : $date\n"; print GOOD "time : $time\n"; print GOOD "time zone : $tz\n"; print GOOD "method : $method\n"; print GOOD "URL : $url\n"; print GOOD "protocol : $protocol\n"; print GOOD "status : $status\n"; print GOOD "bytes : $bytes\n"; %
RE: Acolyte needs regex help
by chromatic (Archbishop) on Jun 23, 2000 at 20:10 UTC
    I bet you could do this with split. I don't see many people taking advantage of its regex engine:
    ($client,$identuser,$authuser,$date,$tz,$method, $url,$protocol,$status,$bytes,$refer,$platform,$extendedinfo) = split(/[\[\]"]?\s[\[\]"]?/, 13);
    Hmm, it leaves a quote sign on the extended info. Pretty close, though.
Re: Acolyte needs regex help
by Ted Nitz (Chaplain) on Jun 23, 2000 at 23:27 UTC
    I've found that when writing complex regexs it helps to use /x and to say exactly what you want. This may be slightly incorrect, I don't have a perl intrepeter to check the syntax of everything with, nor do I have any of my regex refrences, since I'm at work, but here it is:
    / (\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}) # IP Address \s+ (\S+) # ident user \s+ (\S+) # auth user \s+ \[ # Square brace ([^\]]+) # date \s # space ([^\]]+) # time zone \] # The closing square brace \s+ " # Opening quote ((?:[^\\] | # method \\.)+) \s ((?:[^\\] | # url \\.)+) \s ((?:[^\\] | # protocol \\.)+) " # Closing quote \s+ (\d+) # status \s+ (\d+) # bytes \s+ " ((?:[^\\] | # refer \\.)+) " \s+ " (?:[^\\] | # platform \\.)+ \s (?:[^\\] | # extended info \\.)+ " /x
    Good luck. I'm not sure you really want a regex, other people have given pretty good ideas. I just figured I could throw in a more complete regex. It's probably a lot more sparce than necessary.
    -Ted
Re: Acolyte needs regex help
by btrott (Parson) on Jun 23, 2000 at 20:20 UTC
    I don't know what webserver you're using, but if you're using Apache/mod_perl, you can set up a PerlLogHandler to automatically log to your database, rather than going through the intermediate step of parsing the logfile.

    I've got a tutorial up here, Web Logs using DBI, and merlyn has a WebTechniques column about this.

RE: Acolyte needs regex help
by flyfishin (Monk) on Jun 23, 2000 at 20:50 UTC
    Just a non-RE note. The -w is telling you, actually warning you, that you haven't declared the variables before using them. It is just a warning and doesn't prevent the script from running.

    Update:
    As Adam pointed out the -w is always good to use. So is use strict. I should have added that since you don't declare the variables before using them, their values will be set to null. Since the RE doesn't assign any values to the variables, they get the null value assigned to them.
      Yes, but if look at the last line of the post, "my problem is that it tells me all my variable are uninitialized and if I print them they are all null values." The problen is that they are null values. This is a great example of why you should use -w, it lead the programmer to the heart of the problem very quickly.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://19584]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others meditating upon the Monastery: (9)
As of 2021-02-26 22:20 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?