Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much

regexp matching bad stuff ...

by raiten (Acolyte)
on Oct 09, 2009 at 14:32 UTC ( #800290=perlquestion: print w/replies, xml ) Need Help??
raiten has asked for the wisdom of the Perl Monks concerning the following question:


I'm trying to dispatch some contents from an apache log depending on a regexp and i have a problem. It seems strings which don't match my regexp goes in the matched area ...

example command-line:
$ cat access_log | perl -pe 'if (s/^(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3 +}).*"(GET|POST|HEAD) (.*?) HTTP\/.*$/$1,$3/) { print } else { print S +TDERR }' 2>out.err > out.csv

results in STDIN
[...],/,/ - - [04/Oct/2009:00:13:42 +0200] "-" 408 - "-" "-" - - [04/Oct/2009:00:13:42 +0200] "-" 408 - - - [04/Oct/2009:00:13:47 +0200] "CONNECT 203.188.201. +253:25 HTTP/1.1" 404 516 "-" "-" - - [04/Oct/2009:00:13:47 +0200] "CONNECT 203.188.201. +253:25 HTTP/1.1" 404 516,//phpMyAdmin/ [...]

STDERR has only valid contents (not matching regexp)

corresponding part of the source file: - - [03/Oct/2009:17:21:43 +0200] "GET / HTTP/1.0" 404 51 +6 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp/3.0; +o m/help/us/ysearch/slurp)" - - [03/Oct/2009:17:21:43 +0200] "GET / HTTP/1.0" 404 51 +6 - - [04/Oct/2009:00:13:42 +0200] "-" 408 - "-" "-" - - [04/Oct/2009:00:13:42 +0200] "-" 408 - - - [04/Oct/2009:00:13:47 +0200] "CONNECT 203.188.201. +253:25 HTTP/1.1" 404 516 "-" "-" - - [04/Oct/2009:00:13:47 +0200] "CONNECT 203.188.201. +253:25 HTTP/1.1" 404 516 - - [04/Oct/2009:00:26:17 +0200] "GET //phpMyAdmin/ HTT +P/1.1" 404 516 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows 98)" - - [04/Oct/2009:00:26:17 +0200] "GET //phpMyAdmin/ HTT +P/1.1" 404 516

Has someone encounters a similar bug ? or is it my regexp ? seems hard to believe that it matched the CONNECT line ...
Normally, out.csv must contains only csv lines.

Best regards

Replies are listed 'Best First'.
Re: regexp matching bad stuff ...
by ikegami (Pope) on Oct 09, 2009 at 14:36 UTC

    Replace -p (which unconditionally prints every line) with -n (which doesn't)

    While you're at it, replace
    cat access_log | perl ...
    perl ... < access_log

Re: regexp matching bad stuff ...
by kennethk (Abbot) on Oct 09, 2009 at 14:47 UTC
    According to perlrun:


    causes Perl to assume the following loop around your program, which makes it iterate over filename arguments somewhat like sed:

       1. LINE:
       2. while (<>) {
       3.     ... # your program goes here
       4. } continue {
       5.     print or die "-p destination: $!\n";
       6. }

    If a file named by an argument cannot be opened for some reason, Perl warns you about it, and moves on to the next file. Note that the lines are printed automatically. An error occurring during printing is treated as fatal. To suppress printing use the -n switch. A -p overrides a -n switch.

    BEGIN and END blocks may be used to capture control before or after the implicit loop, just as in awk.

    And so, accordingly, you should be using the -n switch (wraps your script in a loop w/o the print statement) in place of the -p.
Re: regexp matching bad stuff ...
by ikegami (Pope) on Oct 09, 2009 at 15:04 UTC

    Earlier I recommended

    perl -ne'if (s/.../.../) { print } else { print STDERR }' <access_log 2>out.err >out.csv

    What I had in mind was

    perl -ne'print { s/.../.../ ? STDOUT : STDERR } $_' <access_log 2>out.err >out.csv

    It just occured to me that it can be shortened to

    perl -pe'select s/.../.../ ? STDOUT : STDERR' <access_log 2>out.err >out.csv
Re: regexp matching bad stuff ...
by Fletch (Chancellor) on Oct 09, 2009 at 15:08 UTC

    Not a direct solution, but perhaps consider using something like Logfile::Access or Regexp::Log::Common rather than reinventing this particular wheel.

    The cake is a lie.
    The cake is a lie.
    The cake is a lie.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://800290]
Approved by moritz
[GotToBTru]: un less you're working on write-only memory
[1nickt]: my $x = $aohoaoh->[0]->{' foo'}->[0]->{'bar' }; should work
[thepkd]: not indexing
[thepkd]: getting data out of the ds
[choroba]: it depends. Give more details, please
[thepkd]: i used a series of {}'s but it dont work
[choroba]: You need square brackets for arrays
[GotToBTru]: you can certainly construct a single expression to access any part of the data structure, without using temp variables. but you might make it easier on yourself and any other poor soul who has to understand your code if you do
[thepkd]: to dereference i mean
LanX .oO( oh tempz, oh moretz)

How do I use this? | Other CB clients
Other Users?
Others having an uproarious good time at the Monastery: (14)
As of 2016-12-06 13:21 GMT
Find Nodes?
    Voting Booth?
    On a regular basis, I'm most likely to spy upon:

    Results (104 votes). Check out past polls.