How can I solve this text manipulation problem with perl?

Tailor has asked for the wisdom of the Perl Monks concerning the following question:

Before I start, quick confession. learning perl has been on my todo list for years. I can laboriously read through a perl script without getting too lost.. But I wouldn't even be a Friar of perl yet. I was hoping to use this task as a way to make me learn some perl, but its apparently too far beyond me. I almost have a working solution using shell and awk, but by all reports, this is the type of problem that perl can blow those things out of the water on, and I'd really like to gain some understanding of perl, while also hopefully getting the solution to this mountain of data I have to convert. The problem in (I was going to say brief.. but it seems kinda long now that I've typed it out). I have been handed about a years worth of logs collected every hour from a server.

This is the base iteration of one of the runs (it runs every 5 minutes
+ every hour)

    2350
     id pool                 type rid rset                  min  max s
+ize used load
      5 SUNWtmp_serverxd1z1      pset   1 SUNWtmp_serverxd1z1       10
+4  104  104 0.00 6.25
      4 SUNWtmp_serverxd1z2      pset   2 SUNWtmp_serverxd1z2        1
+6   16   16 0.00 0.91
      0 pool_default         pset  -1 pset_default           24  66K  
+ 24 0.00 1.74
    
     id pool                 type rid rset                  min  max s
+ize used load
      5 SUNWtmp_serverxd1z1      pset   1 SUNWtmp_serverxd1z1       10
+4  104  104 5.01 6.21
      4 SUNWtmp_serverxd1z2      pset   2 SUNWtmp_serverxd1z2        1
+6   16   16 0.97 0.91
      0 pool_default         pset  -1 pset_default           24  66K  
+ 24 3.73 1.78
    
 > output truncated, but it goes on for 50 lines from the prior timest
+amp, until the next one.
[download]

Each run is 50 lines long (they all get combined into a file that is around 14400 lines for each day, with the field in the front of each line being the date derived from the file name. Here is what they want it to look like. Field position in terms of white space doesn't seem to matter, just relative field position, including the new field "int" which is shown iterating to 2, but would actually only iterate once every 50 lines (the complete data collection run), and then start back at 01.

    date     hhmm int id pool                type rid rset            
+      min  max size used load
    20121105 2350 01  5 SUNWtmp_serverxd1z1      pset   1 SUNWtmp_serv
+erxd1z1       104  104  104 0.00 6.25
    20121105 2350 01  4 SUNWtmp_serverxd1z2      pset   2 SUNWtmp_serv
+erxd1z2        16   16   16 0.00 0.91
    20121105 2350 01  0 pool_default         pset  -1 pset_default    
+       24  66K   24 0.00 1.74
    
    date     hhmm int id pool                type rid rset            
+      min  max size used load
    20121105 2350 02  5 SUNWtmp_serverxd1z1      pset   1 SUNWtmp_serv
+erxd1z1       104  104  104 5.01 6.21
    20121105 2350 02  4 SUNWtmp_serverxd1z2      pset   2 SUNWtmp_serv
+erxd1z2        16   16   16 0.97 0.91
    20121105 2350 02  0 pool_default         pset  -1 pset_default    
+       24  66K   24 3.73 1.78
[download]

I've tried a few sed and awk one liners, but come to the sad realization that not only are they not all that good for this kind of scenario, I really want to see how this can be done in perl. I've never had to manipulate text in any way that was more complex than a 1 liner could handle, and at this point I see this file needing something more complex than my one liners, but perhaps not more complex than a perl monks one liner. The text in the date column is derived from the file name coming in. 20121003-poolstat_a_serverd1z0.txt The time is the 4 digit numeric every 50 lines. The int field needs to iterate each time the poolstat is run. Se below for details.

In summary, the only fields that need to be changed in the mostly numeric lines: field 1, the 8 digit date, derived from filename IE: 20121003-poolstat_a_serverd1z0.txt field 2 the 4 digit time that is inside the file every 50th line. field 3 the iteration count, as follows: Based on digits 3 and 4 of the 4 digit time. 00-05-10-15-20-25-30-35-40-45-50-55 minute of run. 01-02-03-04-05-06-07-08-09-10-11-12 iteration. The rest is just printing out existing fields, its getting those onto a line, then awk ( or other) command to print out the other 10 fields, all while keeping track of the current iteration. And just to keep things complex, the fields in the mostly alpha header line also need 3 new fields: "date hhmm int" the rest of the fields are headers supplied by poolstat, but need to somehow be appended to the "date hhmm int" string, without something wierd happening.

Comment on How can I solve this text manipulation problem with perl? Select or Download Code

Replies are listed 'Best First'.

Re: How can I solve this text manipulation problem with perl?
by SuicideJunkie (Vicar) on May 06, 2013 at 19:02 UTC

I would start by reading each line, ignoring it if it is blank/unrecognized, swapping it for the new header if it was a header, and if it looks like data, splitting it up into a list of values using split on whitespace.

Add any new values needed, such as the date from filename, and then use printf to write it back out in the desired order with fixed width fields.

Seems I got a bit carried away tho:

#Fill in ...'s with more code as needed.
use strict;
use warnings;

my $filename = 'somelog.log';
my $outfilename = 'mungedlog.log';

my $oldHeader = '     id pool                 type rid rset           
+       min  max size used load';

# Define the output format, and where the data comes from
my @newFormatColumns = (
{title=>'date', length=>10, format='s', splitDataIndex=>15},
...
{title=>'id', length=>3, format=>'d', splitDataIndex=>0},
...
);

# Build new header lines automagically based on above AoH
my $newHeaderFormat;
$newHeaderFormat .= '%'. $_->{length} . 's' for @newFormatColumns;
my $newHeader = sprintf($newHeaderFormat, ( map {$_->{title}} @newForm
+atColumns));

open my $iFH, '<', $filename or die "Can't open '$filename' because: $
+!\n";
open my $oFH, '>', $outfilename or die "Can't open '$outfilename' beca
+use: $!\n";

#Grind through file until done.
while (my $line = <$fh>)
{
  chomp $line;
  if ($line =~ /.../) # looks like data
  {
    # Munge data lines into new format
    # grab data from the line read
    my @data = split /\s/, $line;
    # add new values to the end
    push @data, 'new values here', 'here', 'and here';
    
    # Build new line of data
    $line = '';

    # this is basically a set of:
    # $line .= sprintf ("%20s", $data[1]);
    # where $data[1] is the pool string for example.
    # and the '20', the 's' and the '1' all come 
    # from the AoH table declared at the beginning.

    $line .= sprintf ("%$_->{length}$_->{format}", $data[$_->{splitDat
+aindex}]) for @newFormatColumns;

  }elsif ($line eq $oldHeader){
    # Change header lines
    $line = $newHeader;
  }else{
    # Leave unrecognized lines alone.
  }
  # And don't forget to print it all to the output file.
  print $oFH "$line\n";
}
[download]

[reply]
[d/l]


"be consistent"
	PerlMonks