Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Cisco Log Files: broken REGEX

by blue_cowdawg (Monsignor)
on Aug 21, 2003 at 23:23 UTC ( [id://285616]=perlquestion: print w/replies, xml ) Need Help??

blue_cowdawg has asked for the wisdom of the Perl Monks concerning the following question:

Given the following sample log file line:

Aug 21 19:00:36 [1.1.1.3.200.125] 410381: Aug 21 23:00:35 UTC: %SEC-6- +IPACCESSLOGP: list 101 denied tcp 10.161.24.153(3988) -> 10.158.24.10 +(135), 1 packet Aug 21 19:00:36 [1.1.1.3.200.125] 410382: Aug 21 23:00:35 UTC: %SEC-6- +IPACCESSLOGDP: list 101 denied icmp 10.165.4.150 -> 211.95.79.233 (8/ +0), 1 packet
I am trying to campture the information in the lines to check for possible virus infestations. I tried using the regex
m@^([A-Z][a-z]+\s+\d+\s+\d+\:\d+\:\d+)\s+([\.\d]+)\s+(\d+)\:\s+([A-Z] +[a-z]+\s+\d+\s+\d+\:\d+\:\d+)\s+([A-Z]{3})\:\s+\%SEC\-6\-[A-Z]+\:\s+l +ist\s+\d+([a-z]+)\s+([a-z]+)\s+(\d+\.\d+\.\d+\.\d+)\s+\-\>\s+(\d+\.\d ++\.\d+\.\d+)\s+\(\d+\/\d+\)\,\s+(\d)\s+packet$@

I know I am going brain dead right now, but can anybody spot anything glaringly obvious with this that is wrong?


Peter @ Berghold . Net

Sieze the cow! Bite the day!

Nobody expects the Perl inquisition!

Test the code? We don't need to test no stinkin' code!
All code posted here is as is where is unless otherwise stated.

Brewer of Belgian style Ales

Replies are listed 'Best First'.
Re: Cisco Log Files: broken REGEX
by chromatic (Archbishop) on Aug 21, 2003 at 23:41 UTC

    What's glaringly obvious is that it could be more maintainable. Breaking it into separate parts could help.

    my $timestamp = qr/[A-Z][a-z]+\s+\d+\s+\d+\:\d+\:\d+/; my $address = qr/[\.\d]+/; my $id = qr/\d+/; my $timezone = qr/[A-Z]+:/; # and so on
Re: Cisco Log Files: broken REGEX
by chunlou (Curate) on Aug 22, 2003 at 00:03 UTC
    It doesn't hurt to split up your regex for readability.
    $_='Aug 21 19:00:36 [1.1.1.3.200.125] 410381: Aug 21 23:00:35 UTC: %SE +C-6-IPACCESSLOGP: list 101 denied tcp 10.161.24.153(3988) -> 10.158.2 +4.10(135), 1 packet'; / ([A-Z][a-z]+\s+\d+\s+\d+\:\d+\:\d+) # Aug 21 19:00:36 \s+ (\[\d+\.\d+\.\d+\.\d+\.\d+\.\d+\]) # [1.1.1.3.200.125] \s+ (\d+:) # 410381: \s+ ([A-Z][a-z]+\s+\d+\s+\d+\:\d+\:\d+) # Aug 21 23:00:35 \s+ ([A-Z]{3}:) # UTC: \s+ (\%SEC-\d-\w+?:) # %SEC-6-IPACCESSLOGP: \s+ (list\s\d+\s.*?) # list 101 denied tcp \s+ (\d+\.\d+\.\d+\.\d+\(\d+\)) # 10.161.24.153(3988) \s+->\s (\d+\.\d+\.\d+\.\d+\(\d+\)) # 10.158.24.10(135) \s*,\s+ (.*) # 1 packet /x; print "$1\n$2\n$3\n$4\n$5\n$6\n$7\n$8\n$9\n$10"; __END__ Aug 21 19:00:36 [1.1.1.3.200.125] 410381: Aug 21 23:00:35 UTC: %SEC-6-IPACCESSLOGP: list 101 denied tcp 10.161.24.153(3988) 10.158.24.10(135) 1 packet
Re: Cisco Log Files: broken REGEX
by eric256 (Parson) on Aug 22, 2003 at 00:10 UTC
    use strict; my $data = "Aug 21 19:00:36 [1.1.1.3.200.125] 410381: Aug 21 23:00:35 +UTC: %SEC-6-IPACCESSLOGP: list 101 denied tcp 10.161.24.153(3988) -> +10.158.24.10(135), 1 packet"; my $timestamp = qr/[A-Z][a-z]+ \d\d \d\d:\d\d:\d\d/; my $address = qr/[\.\d]+/; my $id = qr/\d+/; my $timezone = qr/[A-Z]+/; #print $data; $data =~ /($timestamp) \[($address)\] ($id): ($timestamp) ($timezone): + (.*?): (.*?) (tcp|icmp|udp) ($address\(.*?\)) -> ($address\(.*?\)), +(.*)/; print "time: $1\n", "address: $2\n", "id: $3\n", "time2: $4\n", "time zone: $5\n", "error: $6\n", "msg: $7\n", "protocol: $8\n", "address1: $9\n", "address2: $10\n", "last: $11\n"; 1;

    Ick. Double Ick. and fragile.

    ___________
    Eric Hodges
Re: Cisco Log Files: broken REGEX (two solutions)
by BrowserUk (Patriarch) on Aug 22, 2003 at 01:12 UTC

    Not only does using /x make things a lot more readable, it also helps with debugging. By commenting out everything except the first element in the final regex, it allowed me to adjust that until it worked for all (both:) test lines. Then I uncommented the next element and adjusted that and so on until the whole thing matched.

    Using named sub elements allows you to re-use thise bits where necessary and would simplify adding in predefined elements like a better IP definition from regexp::Common or a datetime from somewhere.

    #! perl -slw use strict; my $re_datetime = qr[ [A-Z] [a-z]{2} \s \d{2} \s \d{2} : \d{2} : \d{2} + ]x; # Aug 21 19:00:36 my $re_MIB = qr/ \[ \d (?: \. \d+ )+ \] /x; # [1.1.1.3.200.125] my $re_msgid = qr[ \d{6} : ]x; # 41 +0381: my $re_TZ = qr[ [A-Z]{3} : ]x; # UT +C: my $re_type = qr[ %SEC-6- [A-Z]+ : ]x; # %SEC-6-IPACCESSLOGP: my $re_listid = qr[ list \s (\d+) ]x; # li +st 101 my $re_action = qr[ [a-z]+ ]x; # de +nied my $re_protocol = qr[ [a-z]+ ]x; # tc +p my $re_ip = qr[ \d+ (?: \. \d+ ){3} ]x; # 10 +.161.24.153 my $re_port = qr[ \( (\d+ (?: / \d+ )? ) \) ]x; # (3 +988) or (8/0) my $re_packets = qr[ , \s+ ( \d+ ) \s+ packet ]x; # , +1 packet my $re_log = qr[ ^ ( $re_datetime ) \s+ ( $re_MIB ) \s+ ( $re_msgid ) \s+ ( $re_datetime) \s+ ( $re_TZ ) \s+ $re_type \s+ $re_listid \s+ ( $re_action ) \s+ ( $re_protocol ) \s+ ( $re_ip ) \s* $re_port? \s+ -> \s+ ( $re_ip ) \s* $re_port? $re_packets \s* $ ]x; while( <DATA> ) { print join'|', $1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, $12, +$13 if $_ =~ m[$re_log]; } =pod output P:\test>285616 Aug 21 19:00:36|[1.1.1.3.200.125]|410381:|Aug 21 23:00:35|UTC:|101|den +ied|tcp|10.161.24.153|3988|10.158.24.10|135|1 Use of uninitialized value in join or string at P:\test\285616.pl8 lin +e 37, <DATA> line 2. Aug 21 19:00:36|[1.1.1.3.200.125]|410382:|Aug 21 23:00:35|UTC:|101|den +ied|icmp|10.165.4.150||211.95.79.233|8/0|1 =cut __DATA__ Aug 21 19:00:36 [1.1.1.3.200.125] 410381: Aug 21 23:00:35 UTC: %SEC-6- +IPACCESSLOGP: list 101 denied tcp 10.161.24.153(3988) -> 10.158.24.10 +(135), 1 packet Aug 21 19:00:36 [1.1.1.3.200.125] 410382: Aug 21 23:00:35 UTC: %SEC-6- +IPACCESSLOGDP: list 101 denied icmp 10.165.4.150 -> 211.95.79.233 (8/ +0), 1 packet

    Note that the second line produces an "uninitialised value" warning for the second line. This is because that line has no port number after the first IP number. This will result in all the capture numbers thereafter being shifted, which is a pain.

    The best way I know of to avoid all the conditionals and stuff required to deal with regexes that contain conditional captures is to capture to named variables using (?{ }) extended regex feature.

    #! perl -slw use strict; use re 'eval'; # Aug 21 19:00:36 my $re_datetime = qr[ [A-Z] [a-z]{2} \s \d{2} \s \d{2} : \d{2} : \d{2} + ]x; my $re_MIB = qr/ \[ \d (?: \. \d+ )+ \ # [1.1.1.3.200.125] my $re_msgid = qr[ \d{6} : ]x; # 41 +0381: my $re_TZ = qr[ [A-Z]{3} : ]x; # UT +C: my $re_type = qr[ %SEC-6- [A-Z]+ : ]x; #%SEC-6-IPACCESSLOGP: my $re_listid = qr[ list \s (\d+) ]x; # li +st 101 my $re_action = qr[ [a-z]+ ]x; # de +nied my $re_protocol = qr[ [a-z]+ ]x; # tc +p my $re_ip = qr[ \d+ (?: \. \d+ ){3} ]x; # 10 +.161.24.153 my $re_port = qr[ \( (\d+ (?: / \d+ )? ) \) ]x; # (3 +988) or (8/0) my $re_packets = qr[ , \s+ ( \d+ ) \s+ packet ]x; # , +1 packet my $re_log = qr[ ^ ( $re_datetime ) \s+ (?{ $first_date = $^N||'' }) ( $re_MIB ) \s+ (?{ $MIB = $^N||'' }) ( $re_msgid ) \s+ (?{ $msgID = $^N||'' }) ( $re_datetime) \s+ (?{ $second_date = $^N||'' }) ( $re_TZ ) \s+ (?{ $TZ = $^N||'' }) $re_type \s+ $re_listid \s+ (?{ $listID = $^N||'' }) ( $re_action ) \s+ (?{ $action = $^N||'' }) ( $re_protocol ) \s+ (?{ $protocol = $^N||'' }) ( $re_ip ) \s* (?{ $ip1 = $^N||'' }) $re_port? \s+ (?{ $port = $^N||'' }) -> \s+ ( $re_ip ) \s* (?{ $ip2 = $^N||'' }) $re_port? (?{ $port2 = $^N||'' }) $re_packets \s* (?{ $packets = $^N||'' }) $ ]x; while( <DATA> ) { our( $first_date, $MIB, $msgID, $second_date, $TZ, $listID, $action, $protocol, $ip1, $port, $ip2, $port2, $packets ); print join'|', $first_date, $MIB, $msgID, $second_date, $TZ, $list +ID, $action, $protocol, $ip1, $port, $ip2, $port2, $pac +kets if $_ =~ m[$re_log]; } =pod output P:\test>285616 Aug 21 19:00:36|[1.1.1.3.200.125]|410381:|Aug 21 23:00:35|UTC:|101|den +ied|tcp|10.161.24.153|3988|10.158.24.10|135|1 Aug 21 19:00:36|[1.1.1.3.200.125]|410382:|Aug 21 23:00:35|UTC:|101|den +ied|icmp|10.165.4.150|10.165.4.150|211.95.79.233|8/0|1 =cut __DATA__ Aug 21 19:00:36 [1.1.1.3.200.125] 410381: Aug 21 23:00:35 UTC: %SEC-6- +IPACCESSLOGP: list 101 denied tcp 10.161.24.153(3988) -> 10.158.24.10 +(135), 1 packet Aug 21 19:00:36 [1.1.1.3.200.125] 410382: Aug 21 23:00:35 UTC: %SEC-6- +IPACCESSLOGDP: list 101 denied icmp 10.165.4.150 -> 211.95.79.233 (8/ +0), 1 packet

    Which I like because it avoids the capture variable shuffling and if you start using this approach consistantly, it becomes pretty much second nature to build regexes this way. The downsides are the "experimental" status of the "zero-width evaluation asserion" (Phew! What a handle:) and the need to use re 'eval'; both of which are frowned upon in some circles.


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller
    If I understand your problem, I can solve it! Of course, the same can be said for you.

Summary: parsing CISCO ACL logs (was Re: Cisco Log Files: broken REGEX)
by blue_cowdawg (Monsignor) on Aug 22, 2003 at 00:54 UTC

    First off at the risk of souding like one of them talkng heads at an Academy Awards cermony I just want to thank everybody for their assistance with this thing. I was going nuts with it.

    Secondly: I always preach to folks that I teach Perl to that one of the first rules of dealing with data is make sure you understand the data before you try to parse it. I should have listened to my own sermons as I belatedly noticed that there were two different line formats depending on if it was a TCP denial or an ICMP denial.

    Secondly chunlou, enlil, chromatic and eric256 all suggested that I make my code more readable by using the qr construction. Advice that I heeded and this contributed greatly to solving this. Both because it was more readable and because I ended up not re-typing the same regexes and fat fingering them.

    First record type

    For the tcp deny the record looked like (just to review):

    Aug 21 19:00:36 [1.1.1.3.200.125] 410381: Aug 21 23:00:35 UTC: %SEC-6- +IPACCESSLOGP: list 101 denied tcp 10.161.24.153(3988) -> 10.158.24.10 +(135), 1 packet
    and so to look for it I set up the following:
    my $dtg=qr([A-Z][a-z]+\s+\d+\s+\d+:\d+:\d+); my $thingy=qr([\.\d]+); my $tz=qr([A-Z]{3}); my $ipaddr=qr(\d+\.\d+\.\d+\.\d+); my $timestamp = qr/[A-Z][a-z]+ \d\d \d\d:\d\d:\d\d/; my $address = qr/[\.\d]+/; my $id = qr/\d+/; my $timezone = qr/[A-Z]+/; my $fragger = qr/(\%SEC-6-IPACCESSLOGP|\%SEC-6-IPACCESSLOGDP)/; my $tcp_deny=qr/^($dtg)\s\[$thingy\]\s\d+:\s($dtg)\s$tz:\s$fragger\:\s +list\s(\d+)\sdenied\s(tcp|udp|icmp)\s($ipaddr)\(\d+\)\s\-\>\s($ipaddr +)\(\d+\),\s(\d+)\spacket/;
    and I actually look for the packet thusly:
    if ( $line =~ m@$tcp_deny@ ) { ... more stuff below

    Second line format

    The second record type looked like:

    Aug 21 19:00:36 [1.1.1.3.200.125] 410382: Aug 21 23:00:35 UTC: %SEC-6- +IPACCESSLOGDP: list 101 denied icmp 10.165.4.150 -> 211.95.79.233 (8/ +0), 1 packet
    which used:
    my $icmp_deny=qr/^($dtg)\s\[$thingy\]\s\d+:\s($dtg)\s$tz:\s$fragger\:\ +slist\s(\d+)\sdenied\s(tcp|udp|icmp)\s($ipaddr)\s\-\>\s($ipaddr)\s\(\ +d+\/\d+\),\s(\d+)\spacket/;

    Why bother?

    That my fellow monks is a tale to tell under Cool Uses for Perl once the script is all done and nice and tidy. It's a mess right now. Just a hint though: It has to do with all these virus attacks going on and how to find the infected machines...


    Peter @ Berghold . Net

    Sieze the cow! Bite the day!

    Nobody expects the Perl inquisition!

    Test the code? We don't need to test no stinkin' code!
    All code posted here is as is where is unless otherwise stated.

    Brewer of Belgian style Ales

Re: Cisco Log Files: broken REGEX
by RMGir (Prior) on Aug 21, 2003 at 23:27 UTC
    All I can see in a quick scan is you don't seem to be accounting for the square brackets around the MIB-like thing after the timestamp.

    Am I nuts?
    --
    Mike

      Good news: You were right and I missed the square braces.
      Bad news: It is still broke after I fixed it.

      Here's the new regex:

      m@^([A-Z][a-z]+\s+\d+\s+\d+\:\d+\:\d+)\s+\[([\.\d]+)\]\s+(\d+)\:\s+([ +A-Z][a-z]+\s+\d+\s+\d+\:\d+\:\d+)\s+([A-Z]{3})\:\s+\%SEC\-6\-[A-Z]+\: +\s+list\s+\d+([a-z]+)\s+([a-z]+)\s+(\d+\.\d+\.\d+\.\d+)\s+\-\>\s+(\d+ +\.\d+\.\d+\.\d+)\s+\(\d+\/\d+\)\,\s+(\d)\s+packet$@


      Peter @ Berghold . Net

      Sieze the cow! Bite the day!

      Nobody expects the Perl inquisition!

      Test the code? We don't need to test no stinkin' code!
      All code posted here is as is where is unless otherwise stated.

      Brewer of Belgian style Ales

        Hehe, I figured that out myself once I whipped up a test bench.

        I still don't have it working, but I do have a few suggestions.

        Don't escape everything in sight, you'll go nuts. : and , don't need \, really.

        m@@x is your friend.

        Could you detect what you need to extract without matching the whole line? Note that ICMP and TCP have different "port" parts, so making a general regex is gonna bite.

        Anyhow, here's my test bench, with my latest non-working version of the regex:


        --
        Mike
Re: Cisco Log Files: broken REGEX
by Abigail-II (Bishop) on Aug 22, 2003 at 08:10 UTC
    One technique I use in debugging long regexes like yours is the build the regex step-by-step, and test the regex after each step. So, in your example, start with the regex that matches the date, run that against your data and see whether it matches. If that's ok, extend the regex with the time, run it again against the data, then add the thing between brackets, etc, etc.

    Abigail

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://285616]
Approved by antirice
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others examining the Monastery: (5)
As of 2024-04-23 06:39 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found