Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister

Perl: Extracting specific text from a .txt file and outputting into a new format

by ragingwhisky (Initiate)
on Nov 18, 2010 at 10:37 UTC ( #872188=perlquestion: print w/ replies, xml ) Need Help??
ragingwhisky has asked for the wisdom of the Perl Monks concerning the following question:

Update: == The log file size and contents can vary so capturing specific lines each time may not work, searching the entries only option.
System : system001-server (System name differs by number only system001..., system002..., system +003 e.t.c) Box : HW a2b 1234Mb 123.4, address is 0012.d345.1234 (abc 0012.d123 +. +4567) (Box name relatively static, capturing between the HW a2b to the brack +eted data would get everything relevant) ByteIn : 3385567095904 (ByteIn varies) Byteout : 5816943852464 (Byteout varies) Description : example_system_server_round (Description name varies, capturing the name after "Description" only +way)
== Hello, hoping someone can help. I'm looking to get a few scripts off the ground to help me with a laborious "log" culling, but my knowledge is a bit limited regarding perl and so am learning as I go. I had been experimenting with grep and while loops e.g.
while (<IN>) { if (m/Server*Example/ig) { print OUT $_, "\n"; print $_; } }
But had been falling short on how to specify each specific part. Out of all the log files I would ideally love to extract the following snippets - System, Box, ByteIn, Bytesout, Description From the log example below the data would be:
System : system001-server Box : HW a2b 1234Mb 123.4, address is 0012.d345.1234 (abc 0012.d123 +. +4567) ByteIn : 3385567095904 Byteout : 5816943852464 Description : example_system_server_example
Once the file data had been extracted from the .txt file --> output to a new CSV format file so I could read it into spreadsheets, example - System Box ByteIn Byteout Description:
system001-server,HW a2b 1234Mb 123.4, address is 0012.d345.1234 (abc 0 + +012.d123.4567),3385567095904,5816943852464,example_system_server_ex +am +ple
This would be the ideal outcome, any help in how to do this properly would be_VERY_ much appreciated. Cas.
log example (from .txt file): system001-server GigabitEthernet7/1 is up, line protocol is up (connected) HW a2b 1234Mb 123.4, address is 0012.d345.1234 (abc 0012.d123.4567) Description: example_system_server_example MTU 1234 bytes, BW 12345678 Kbit, DLY 10 usec, reliability 255/255, txload 1/255, rxload 1/255 Encapsulation ARPA, loopback not set Keepalive set (10 sec) Full-duplex, 1000Mb/s, media type is 10/100/1000BaseT input flow-control is off, output flow-control is on Input queue: 0/2000/249/0 (size/max/drops/flushes); Total output drops +: 9725919 5 minute input rate 1918000 bits/sec, 166 packets/sec 5 minute output rate 659000 bits/sec, 154 packets/sec 3448177417 packets input, 3385567095904 bytes, 0 no buffer Received 8858 broadcasts (285 multicasts) 4942780696 packets output, 5816943852464 bytes, 0 underrun GigabitEthernet7/2 is up, line protocol is up (connected) HW a2b......... ==

Comment on Perl: Extracting specific text from a .txt file and outputting into a new format
Select or Download Code
Replies are listed 'Best First'.
Re: Perl: Extracting specific text from a .txt file and outputting into a new format
by SimonClinch (Chaplain) on Nov 18, 2010 at 15:57 UTC
    I frequently have to parse all kinds of output and every case is different, but I tend to go through the following process, each step makes the next one trivial, once you get the hang of it:

    1) identify the lexical structure of the material -- can it be multiline, does indentation matter, etc.?

    2) create a simple lexical analyser out of a hash of regexes and token names.

    3) create a thrower or two that ejects white space and/or empty lines, comments etc.

    4) create a trivial parser that calls the trivial lexer and thrower and has a subroutine to manage each type of opening landmark (encounter with an identifying string), typically loading it into a suitable structure or printing directly at the end of the section (via closing landmark)

    5) if not printing as we go, traverse and print the structure

    Update: code example of a lexer

    package logparse; sub new { return bless { LEX => { '\w+' => 'TOK_ID', '^[:punct:]+' => 'TOK_PUNCT', # and so on for all character classes you +identify }}; } sub lex { my $self = shift; my $fh = $self -> { FH }; $self -> { BUFFER } ||= <$fh> or goto EOF; PAT: while ( my ($pat, $tok) = each %{ $self -> { LEX }} ) { $/^($pat)(.*)$/ or next PAT; $self -> { BUFFER } = $2; $self -> { LEXVAL } = $1; return $tok; } $self -> { LEXVAL } = substr( $self -> { BUFFER }, 0, 1 ); $self -> { BUFFER } =~ s/^.//; warn "unhandled content at $fh line $.\n"; return ''; EOF: $self -> { LEXVAL } = ''; return 'TOK_EOF'; }

    One world, one people

Re: Perl: Extracting specific text from a .txt file and outputting into a new format
by fisher (Priest) on Nov 18, 2010 at 10:49 UTC
    How to find the system name in this log? Please explain using natural language. Does it have static position (say, the very first line)? Does it have empty line before? Maybe, there is some pattern to look for (say, system\d+-.*) ?
      well, so what seems to be the problem?
      #!/usr/bin/env perl # use strict; use warnings; open F, "<fff" or die "Aaaahg..."; # current system my $system; # our params here my %box; my %bin; my %bout; my %desc; while (<F>) { /^system\d+-server$/ and do {$system= $_; next}; /^HW a2b.*\(.*\)$/ and $box{$system}= $_; s/^\s*(\d+) packets input.*$/$1/ and $bin{$system}= $_; s/^\s*(\d+) packets output.*$/$1/ and $bout{$system}= $_; s/^Description: (.*)$/$1/ and $desc{$system}= $_; } foreach (keys %desc) { print "System: ". $_; print "Box: ". $box{$_}; print "ByteIn: ". $bin{$_}; print "Byteout: ". $bout{$_}; print "Description: ". $desc{$_}; }
Re: Perl: Extracting specific text from a .txt file and outputting into a new format
by biohisham (Priest) on Nov 18, 2010 at 14:39 UTC
    This should get you started, we don't know before hand whether a given line has the log entry that we need to capture but we assume that the log entries follow a certain order that may be intermittent by other uninteresting log entries (i.e each new log record starts with the 'System' entry). In the code below I am capturing the server name, the bytes in and the bytes out into a hash of incremental records, you can extend on that to capture the rest of the entries you seek..
    #!/usr/local/bin/perl use strict; use warnings; use Data::Dump qw(pp); my %log; my $record = 0; #to be incremented on each round while(my $line = <DATA>){ chomp $line; my ($system,$server,$b_in,$byteIn,$b_out, $byteOut); if($line =~ /System/){ $record++ ; ($system,$server) = split / : /,$line; push @{$log{'record'.$record}},{$system=>$server} }elsif($line =~ /ByteIn/){ ($b_in, $byteIn) = split / : /,$line; push @{$log{'record'.$record}},{$b_in=>$byteIn +} ; }elsif($line =~ /Byteout/){ ($b_out, $byteOut) = split / : /,$line +; push @{$log{'record'.$record}},{$b_out +=>$byteOut} } } print pp \%log; __DATA__ System : system001-server Box : HW a2b 1234Mb 123.4, address is 0012.d345.1234 (abc 0012.d123 +. +4567) Byteout : 5816943852464 ByteIn : 4 Description : example_system_server1_example System : system002-server Box : HW a2b 1234Mb 123.4, ByteIn : 3385 Byteout : 58169 Description : example_system_server2_example
    { record1 => [ { System => "system001-server" }, { Byteout => "5816943852464" }, { ByteIn => 4 }, ], record2 => [ { System => "system002-server" }, { ByteIn => 3385 }, { Byteout => 58169 }, ], }
    read Data::Dump, Data Structures Cookbook (perldsc), perlref

    Excellence is an Endeavor of Persistence. A Year-Old Monk :D .
Re: Perl: Extracting specific text from a .txt file and outputting into a new format
by sundialsvc4 (Abbot) on Nov 19, 2010 at 14:27 UTC

    It might be useful to apply “state machine” logic here.   Or even to approach the parsing task using a tool such as Parse::RecDescent (which has become a very well-known tool to me as of late...).

    Basically, it seems that the best way to describe this problem is that “the proper interpretation of what is in front of me now, depends upon what surrounds it; on what has come before.”   That contextual knowledge can be represented in a “state.”   And, the greater task might well be expressable using a grammar, hence my suggestion of a true parser.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://872188]
Approved by marto
Front-paged by biohisham
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others rifling through the Monastery: (13)
As of 2015-07-29 12:49 GMT
Find Nodes?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...

    Results (263 votes), past polls