Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

regexp matching bad stuff ...

by raiten (Acolyte)
on Oct 09, 2009 at 14:32 UTC ( #800290=perlquestion: print w/ replies, xml ) Need Help??
raiten has asked for the wisdom of the Perl Monks concerning the following question:

Hello,

I'm trying to dispatch some contents from an apache log depending on a regexp and i have a problem. It seems strings which don't match my regexp goes in the matched area ...

example command-line:
$ cat access_log | perl -pe 'if (s/^(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3 +}).*"(GET|POST|HEAD) (.*?) HTTP\/.*$/$1,$3/) { print } else { print S +TDERR }' 2>out.err > out.csv

results in STDIN
[...] 72.30.161.243,/ 72.30.161.243,/ 125.224.206.168 - - [04/Oct/2009:00:13:42 +0200] "-" 408 - "-" "-" 125.224.206.168 - - [04/Oct/2009:00:13:42 +0200] "-" 408 - 125.224.206.168 - - [04/Oct/2009:00:13:47 +0200] "CONNECT 203.188.201. +253:25 HTTP/1.1" 404 516 "-" "-" 125.224.206.168 - - [04/Oct/2009:00:13:47 +0200] "CONNECT 203.188.201. +253:25 HTTP/1.1" 404 516 96.243.255.188,//phpMyAdmin/ [...]

STDERR has only valid contents (not matching regexp)

corresponding part of the source file:
72.30.161.243 - - [03/Oct/2009:17:21:43 +0200] "GET / HTTP/1.0" 404 51 +6 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp/3.0; http://help.yahoo.c +o m/help/us/ysearch/slurp)" 72.30.161.243 - - [03/Oct/2009:17:21:43 +0200] "GET / HTTP/1.0" 404 51 +6 125.224.206.168 - - [04/Oct/2009:00:13:42 +0200] "-" 408 - "-" "-" 125.224.206.168 - - [04/Oct/2009:00:13:42 +0200] "-" 408 - 125.224.206.168 - - [04/Oct/2009:00:13:47 +0200] "CONNECT 203.188.201. +253:25 HTTP/1.1" 404 516 "-" "-" 125.224.206.168 - - [04/Oct/2009:00:13:47 +0200] "CONNECT 203.188.201. +253:25 HTTP/1.1" 404 516 96.243.255.188 - - [04/Oct/2009:00:26:17 +0200] "GET //phpMyAdmin/ HTT +P/1.1" 404 516 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows 98)" 96.243.255.188 - - [04/Oct/2009:00:26:17 +0200] "GET //phpMyAdmin/ HTT +P/1.1" 404 516

Has someone encounters a similar bug ? or is it my regexp ? seems hard to believe that it matched the CONNECT line ...
Normally, out.csv must contains only csv lines.

thanks
Best regards

Comment on regexp matching bad stuff ...
Select or Download Code
Re: regexp matching bad stuff ...
by ikegami (Pope) on Oct 09, 2009 at 14:36 UTC

    Replace -p (which unconditionally prints every line) with -n (which doesn't)

    While you're at it, replace
    cat access_log | perl ...
    with
    perl ... < access_log

Re: regexp matching bad stuff ...
by kennethk (Monsignor) on Oct 09, 2009 at 14:47 UTC
    According to perlrun:

    -p

    causes Perl to assume the following loop around your program, which makes it iterate over filename arguments somewhat like sed:

       1. LINE:
       2. while (<>) {
       3.     ... # your program goes here
       4. } continue {
       5.     print or die "-p destination: $!\n";
       6. }

    If a file named by an argument cannot be opened for some reason, Perl warns you about it, and moves on to the next file. Note that the lines are printed automatically. An error occurring during printing is treated as fatal. To suppress printing use the -n switch. A -p overrides a -n switch.

    BEGIN and END blocks may be used to capture control before or after the implicit loop, just as in awk.

    And so, accordingly, you should be using the -n switch (wraps your script in a loop w/o the print statement) in place of the -p.
Re: regexp matching bad stuff ...
by ikegami (Pope) on Oct 09, 2009 at 15:04 UTC

    Earlier I recommended

    perl -ne'if (s/.../.../) { print } else { print STDERR }' <access_log 2>out.err >out.csv

    What I had in mind was

    perl -ne'print { s/.../.../ ? STDOUT : STDERR } $_' <access_log 2>out.err >out.csv

    It just occured to me that it can be shortened to

    perl -pe'select s/.../.../ ? STDOUT : STDERR' <access_log 2>out.err >out.csv
Re: regexp matching bad stuff ...
by Fletch (Chancellor) on Oct 09, 2009 at 15:08 UTC

    Not a direct solution, but perhaps consider using something like Logfile::Access or Regexp::Log::Common rather than reinventing this particular wheel.

    The cake is a lie.
    The cake is a lie.
    The cake is a lie.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://800290]
Approved by moritz
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others contemplating the Monastery: (6)
As of 2014-08-21 03:39 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The best computer themed movie is:











    Results (127 votes), past polls