Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

How can I solve this text manipulation problem with perl?

by Tailor (Initiate)
on May 06, 2013 at 17:29 UTC ( [id://1032354]=perlquestion: print w/replies, xml ) Need Help??

Tailor has asked for the wisdom of the Perl Monks concerning the following question:

Before I start, quick confession. learning perl has been on my todo list for years. I can laboriously read through a perl script without getting too lost.. But I wouldn't even be a Friar of perl yet. I was hoping to use this task as a way to make me learn some perl, but its apparently too far beyond me. I almost have a working solution using shell and awk, but by all reports, this is the type of problem that perl can blow those things out of the water on, and I'd really like to gain some understanding of perl, while also hopefully getting the solution to this mountain of data I have to convert. The problem in (I was going to say brief.. but it seems kinda long now that I've typed it out). I have been handed about a years worth of logs collected every hour from a server.

This is the base iteration of one of the runs (it runs every 5 minutes + every hour) 2350 id pool type rid rset min max s +ize used load 5 SUNWtmp_serverxd1z1 pset 1 SUNWtmp_serverxd1z1 10 +4 104 104 0.00 6.25 4 SUNWtmp_serverxd1z2 pset 2 SUNWtmp_serverxd1z2 1 +6 16 16 0.00 0.91 0 pool_default pset -1 pset_default 24 66K + 24 0.00 1.74 id pool type rid rset min max s +ize used load 5 SUNWtmp_serverxd1z1 pset 1 SUNWtmp_serverxd1z1 10 +4 104 104 5.01 6.21 4 SUNWtmp_serverxd1z2 pset 2 SUNWtmp_serverxd1z2 1 +6 16 16 0.97 0.91 0 pool_default pset -1 pset_default 24 66K + 24 3.73 1.78 > output truncated, but it goes on for 50 lines from the prior timest +amp, until the next one.

Each run is 50 lines long (they all get combined into a file that is around 14400 lines for each day, with the field in the front of each line being the date derived from the file name. Here is what they want it to look like. Field position in terms of white space doesn't seem to matter, just relative field position, including the new field "int" which is shown iterating to 2, but would actually only iterate once every 50 lines (the complete data collection run), and then start back at 01.

date hhmm int id pool type rid rset + min max size used load 20121105 2350 01 5 SUNWtmp_serverxd1z1 pset 1 SUNWtmp_serv +erxd1z1 104 104 104 0.00 6.25 20121105 2350 01 4 SUNWtmp_serverxd1z2 pset 2 SUNWtmp_serv +erxd1z2 16 16 16 0.00 0.91 20121105 2350 01 0 pool_default pset -1 pset_default + 24 66K 24 0.00 1.74 date hhmm int id pool type rid rset + min max size used load 20121105 2350 02 5 SUNWtmp_serverxd1z1 pset 1 SUNWtmp_serv +erxd1z1 104 104 104 5.01 6.21 20121105 2350 02 4 SUNWtmp_serverxd1z2 pset 2 SUNWtmp_serv +erxd1z2 16 16 16 0.97 0.91 20121105 2350 02 0 pool_default pset -1 pset_default + 24 66K 24 3.73 1.78

I've tried a few sed and awk one liners, but come to the sad realization that not only are they not all that good for this kind of scenario, I really want to see how this can be done in perl. I've never had to manipulate text in any way that was more complex than a 1 liner could handle, and at this point I see this file needing something more complex than my one liners, but perhaps not more complex than a perl monks one liner. The text in the date column is derived from the file name coming in. 20121003-poolstat_a_serverd1z0.txt The time is the 4 digit numeric every 50 lines. The int field needs to iterate each time the poolstat is run. Se below for details.

In summary, the only fields that need to be changed in the mostly numeric lines: field 1, the 8 digit date, derived from filename IE: 20121003-poolstat_a_serverd1z0.txt field 2 the 4 digit time that is inside the file every 50th line. field 3 the iteration count, as follows: Based on digits 3 and 4 of the 4 digit time. 00-05-10-15-20-25-30-35-40-45-50-55 minute of run. 01-02-03-04-05-06-07-08-09-10-11-12 iteration. The rest is just printing out existing fields, its getting those onto a line, then awk ( or other) command to print out the other 10 fields, all while keeping track of the current iteration. And just to keep things complex, the fields in the mostly alpha header line also need 3 new fields: "date hhmm int" the rest of the fields are headers supplied by poolstat, but need to somehow be appended to the "date hhmm int" string, without something wierd happening.

Replies are listed 'Best First'.
Re: How can I solve this text manipulation problem with perl?
by SuicideJunkie (Vicar) on May 06, 2013 at 19:02 UTC

    I would start by reading each line, ignoring it if it is blank/unrecognized, swapping it for the new header if it was a header, and if it looks like data, splitting it up into a list of values using split on whitespace.

    Add any new values needed, such as the date from filename, and then use printf to write it back out in the desired order with fixed width fields.

    Seems I got a bit carried away tho:

    #Fill in ...'s with more code as needed. use strict; use warnings; my $filename = 'somelog.log'; my $outfilename = 'mungedlog.log'; my $oldHeader = ' id pool type rid rset + min max size used load'; # Define the output format, and where the data comes from my @newFormatColumns = ( {title=>'date', length=>10, format='s', splitDataIndex=>15}, ... {title=>'id', length=>3, format=>'d', splitDataIndex=>0}, ... ); # Build new header lines automagically based on above AoH my $newHeaderFormat; $newHeaderFormat .= '%'. $_->{length} . 's' for @newFormatColumns; my $newHeader = sprintf($newHeaderFormat, ( map {$_->{title}} @newForm +atColumns)); open my $iFH, '<', $filename or die "Can't open '$filename' because: $ +!\n"; open my $oFH, '>', $outfilename or die "Can't open '$outfilename' beca +use: $!\n"; #Grind through file until done. while (my $line = <$fh>) { chomp $line; if ($line =~ /.../) # looks like data { # Munge data lines into new format # grab data from the line read my @data = split /\s/, $line; # add new values to the end push @data, 'new values here', 'here', 'and here'; # Build new line of data $line = ''; # this is basically a set of: # $line .= sprintf ("%20s", $data[1]); # where $data[1] is the pool string for example. # and the '20', the 's' and the '1' all come # from the AoH table declared at the beginning. $line .= sprintf ("%$_->{length}$_->{format}", $data[$_->{splitDat +aindex}]) for @newFormatColumns; }elsif ($line eq $oldHeader){ # Change header lines $line = $newHeader; }else{ # Leave unrecognized lines alone. } # And don't forget to print it all to the output file. print $oFH "$line\n"; }

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1032354]
Approved by Corion
Front-paged by davido
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others examining the Monastery: (5)
As of 2024-04-23 17:29 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found