Limbic~Region has asked for the wisdom of the Perl Monks concerning the following question:
All:
For those of you who are interested in JUST the problem and not the background - skip to the text identified with PROBLEM.
I just finished a month long project that added a suite of related scripts to our "toolbox" arsenal at work. Of course, the second I was done and put it into production - I received a thousand feature requests from co-workers. Some of these requests were easy to accomodate, while others require re-designing and won't be done for a long time to come. One of them seems deceptively simple I was hoping you all could help me with.
One of the scripts monitors a directory for transient files (race condition) containing "bad" information and moves them to another directory and writes a log of it. Since this needed to be super duper ultra fast, I wrote as little to the log file as possible. For instance:
Instead of giving it a human readable date/time stamp - I used seconds since epoch.
I left out two columns (directory name & direction) so that print didn't have to interpolate two variables.
Now, many instances of the above script are running on multiple directories - so there are multiple logs. I wrote a companion script to parse those logs and put it in human readable format. For instance:
The timestamp is converted back into human readable format
The directory name and direction is inserted into the output as it is extrapolated from the filename/path.
The script, which you can readmore here....
#!/usr/bin/perl -w
use strict;
use Getopt::Std;
use Time::Local;
use POSIX qw(strftime);
$|++;
my %Opt;
my @Conns;
&GetArgs();
&GetConns();
&GetLogs();
sub GetArgs {
my $Usage = qq{Usage: $0 [options]
-h : This help message.
-c : Specific connector - default is to list all connectors.
-d : Specific direction - default is to list all directions
-n : Trap name - default is to list all names
-t : Time in stamp format mm/dd/yy-hh:mm or mm/dd/yy
+<stamp> - show entries created after specified stamp
If time is not given, defaults to 23:59
-<stamp> - show entries created before specified stamp
If time is not given, default to 00:00
=<stamp> - show entries created on specified stamp
If time is not given it is ignored (all day)
<stamp>+-<stamp> - show entries created between specif
+ied stamps
If time is not given on first stamp, 00:00 is
+used
If time is not given on second stamp, 23:59 is
+ used
Note: This includes the day(s) specified
-s : Size of files caught in bytes
+<size> - show entries with files larger than specifie
+d size
-<size> - show entries with files smaller than specifi
+ed size
=<size> - show entries with files equal to specified s
+ize
<size>+-<size> - show entries with files between speci
+fied sizes
} . "\n";
getopts( 'hc:d:n:t:s:', \%Opt ) or die "$Usage";
die "$Usage" if $Opt{h};
if ($Opt{d}) {
$Opt{d} = lc($Opt{d});
die "$Usage" if ($Opt{d} ne "in" && $Opt{d} ne "out" && $Opt{d} ne
+ "both");
}
}
sub GetConns {
open (CONNECTORS,"/var/wt400/conf/_wtd.cfg") or die "\nUnable to ope
+n connector file!\n";
while (<CONNECTORS>) {
next unless ($_ =~ /^unit="(.*)"/);
my $Conn = lc($1);
next if ($Conn eq "ins" || $Conn eq "ins2" || $Conn eq "_wtd");
push @Conns , $Conn;
}
close (CONNECTORS);
if ($Opt{c}) {
$Opt{c} = lc($Opt{c});
if (grep /\b$Opt{c}\b/ , @Conns) {
@Conns = $Opt{c};
}
else {
die "\nInvalid connector - $Opt{c} !\n";
}
}
}
sub GetLogs {
my @Logs;
foreach my $Conn (@Conns) {
my @Directions;
if ($Opt{d}) {
@Directions = $Opt{d};
}
else {
@Directions = (qw(in out both));
}
foreach my $Dir (@Directions) {
push @Logs , "/var/spool/wt400/log/$Conn/trap_${Dir}.log" if (-r
+ "/var/spool/wt400/log/$Conn/trap_${Dir}.log" && -s _);
}
}
unless (@Logs) {
die "\nUnable to find any logs!\n";
}
else {
while (my $File = shift @Logs) {
my($mon, $day, $year, $hour, $min);
open(LOG,$File);
LINE:
while (my $Line = <LOG>) {
chomp $Line;
my @Fields = split " " , $Line;
if ($Opt{n}) {
next unless (lc($Opt{n}) eq lc($Fields[3]));
}
if ($Opt{t}) {
$Opt{t} =~ s/\s+//;
my $Stamp1;
my $Stamp2;
if ($Opt{t} =~ /^\+(.*)/) {
($mon, $day, $year, $hour, $min) = split ?[-/:]? , $1;
($hour,$min) = (23,59) unless ($hour && $min);
$Stamp1 = timelocal(0, $min, $hour, $day, $mon - 1, $year +
+ 100);
next unless ($Fields[0] > $Stamp1);
}
elsif ($Opt{t} =~ /^\-(.*)/) {
($mon, $day, $year, $hour, $min) = split ?[-/:]? , $1;
($hour,$min) = (00,00) unless ($hour && $min);
$Stamp1 = timelocal(0, $min, $hour, $day, $mon - 1, $year +
+ 100);
next unless ($Fields[0] < $Stamp1);
}
elsif ($Opt{t} =~ /^\=(.*)/) {
($mon, $day, $year, $hour, $min) = split ?[-/:]? , $1;
($hour,$min) = (00,00) unless ($hour && $min);
$Stamp1 = timelocal(0, $min, $hour, $day, $mon - 1, $year
++ 100);
($hour,$min) = (23,59) unless ($hour && $min);
$Stamp2 = timelocal(0, $min, $hour, $day, $mon - 1, $year
++ 100);
next unless ($Fields[0] >= $Stamp1 && $Fields[0] <= $Stamp
+2 );
}
elsif ($Opt{t} =~ /^(.*)\+\-(.*)/) {
($mon, $day, $year, $hour, $min) = split ?[-/:]? , $1;
($hour,$min) = (00,00) unless ($hour && $min);
$Stamp1 = timelocal(0, $min, $hour, $day, $mon - 1, $year
++ 100);
($mon, $day, $year, $hour, $min) = split ?[-/:]? , $2;
($hour,$min) = (23,59) unless ($hour && $min);
$Stamp2 = timelocal(0, $min, $hour, $day, $mon - 1, $year
++ 100);
next unless ($Fields[0] >= $Stamp1 && $Fields[0] <= $Stamp
+2 );
}
}
if ($Opt{s}) {
$Opt{s} =~ s/\s+//;
if ($Opt{s} =~ /^\+(.*)/) {
next unless ($Fields[2] > $1);
}
elsif ($Opt{s} =~ /^\-(.*)/) {
next unless ($Fields[2] < $1);
}
elsif ($Opt{s} =~ /^\=(.*)/) {
next unless ($Fields[2] == $1);
}
elsif ($Opt{s} =~ /^(.*)\+\-(.*)/) {
next unless ($Fields[2] >= $1 && $Fields[2] <= $2 );
}
}
if ($File =~ /^.*\/(.*)\/trap_(.*)\.log/) {
my $Conn = $1;
my $Dir = $2;
my $Time = strftime("[%x-%X]",localtime($Fields[0]));
print "$Time $Conn $Dir $Fields[3] $Fields[1] $Fields[2]\n";
}
}
}
}
}
allows the option of just looking at one log or if no option is specified, all the logs at the same time as well as parsing for only specific information.
PROBLEM
My co-workers would like to have the logs interleaved (sorted chronologically) if they are displaying all the logs at once. This seems incredibly easy since the first column is a timestamp and would automatically sort chronologizally. The problem is there are two columns in the output (see below) that are dynamically generated based off of filename/path.
RAW:
1044007259 do15505x 467 PaulRidge
1044022188 do15667s 876 Tom-Snow
1044029052 do15854j 3228 BCorcoran
FORMATED:
[01/31/03-11:41:28] DIR1 out MarkLester doqnh6y5 10300
[01/31/03-16:28:20] DIR1 out BrianSmith doavr564 8353
[01/31/03-16:38:12] DIR1 out MarkLester doavr5g4 9663
[01/30/03-23:02:08] DIR2 out PaulRidge do15347q 2394
[01/30/03-23:02:08] DIR2 out PaulRidge do15347t 492
Note: The raw and formated are samples and do not represent the same d
+ata.
Since my code currently reads each file in one at a time, it is able to dynamically generate these two columns. To sort all the results chronologically means that I would have to read them all in first (they could get quite large), perform the sort, and print out the output. I thought of the following alternative options:
Change the first script interpolate the variables and include them in the log (NOT desirable).
Read each log in, parse and format, append to a single temporary file, read the file back in using a pipe as a file handle open(LOGS, "sort <file> |");, displaying the results, remove the temporary file.
Telling my co-workers just to | to sort ("what, you mean Perl can't do something?" as they are always quick to say)
Tie the files to a hash, merge, sort, display, untie (seems too busy and inefficient, but I could be wrong)
What I would like to do is open (LOGS,"sort @Logs |");, but then I would lose the filename/path and wouldn't be able to generate those two columns.
Any advice (besides my regular expressions in parsing my data)?
Thanks in advance, L~R
Re: Log parsing by timestamp dilema
by adrianh (Chancellor) on Feb 01, 2003 at 11:33 UTC
|
To sort all the results chronologically means that I would have to read them all in first (they could get quite large), perform the sort, and print out the output. I thought of the following alternative options:
Assuming each log file is in chronological order wouldn't it be easier to go through them in parallel, writing out the the entries in order as you go? Something like this:
# MAKE A FILEHANDLE FOR EACH FILE WE WERE GIVEN
my @files = map {new IO::File $_ or die "could not open $_\n"} @ARGV;
# READ IN A LINE FOR EACH FILE
my @lines = map {scalar(<$_>)} @files;
# GET THE DATES FOR EACH LINE;
my @dates = map {get_time($_)} @lines;
my $MAX = <some date bigger than anything in the logs>;
my $found;
do {
# FIND THE LINE WITH THE EARLIEST DATE
my $min = $MAX;
$found = undef;
for (my $i=0; $i<$num_logs; $i++) {
my $num = $dates[$i];
if ($num < $min) {
$found = $i; $min = $num;
};
};
if (defined($found)) {
# IF WE FOUND A LINE, SHOW IT AND READ THE NEXT
# LINE IN FOR THAT LOG FILE
print $lines[$found];
my $io = $files[$found];
$lines[$found] = <$io>;
$dates[$found] = get_time($lines[$found]);
};
} while (defined($found));
I did something similar to this to merge multiple apache access logs together a few years back. | [reply] [d/l] |
|
I am not sure I understand how your code processes the log files in parallel. The following logic came to mind as I read your post, which may be what your code snippet is - correct me if I am wrong.
- Open up all files and read the first line
- Determine which line/file had the earliest stamp
- Print that line, while keeping the other lines in the array
- Read the next line from the matched file back into the array
- Repeat step 2
This appears to be good logic, even though I can't discern how this works from your code. Of course, in order for this to work for me, I would have to add a lot more code since I need to include the file name/path in each array - because we know that you can't get a filename/path from a filehandle. For this reason, I would probably use a hash. If this is not what you meant and I am completely off base - please let me know.
I was thinking that I was going to have to resort (pun intended) to telling them to | sort.
Cheers - L~R
UPDATE: - edited to make logic clear
| [reply] |
|
Hi.
Sorry, I couldn't resist rewriting your code. :-) The problem "got at me". It uses adrianh's solution, but translating it into your script, you would end up with something like the rewrite below.
First, I removed the whole while loop, around lines 86-90, and all the code inbetween was cut out and saved for later. Alot of the repeated code was moved into subroutines. I have tested it as best as I can, and it works for me. I took advantage of the fact that you had already done the work of finding the files, which were stored in @Logs. This was used instead of @ARGV. I tried not to impose my coding style on the script, but it has been run through PerlTidy. This may have moved stuff around a bit.
The other main change was the way of handling specified date ranges. Whilst the 'if' logic remains, I generalised it into a subroutine, and made use of a new %Range hash to store the 'begin' and 'end' dates (which are updated if the '-t' option is specified). By defaulting appropriately, this allows the code to check for dates being in the range the user want in just one line of code. Also, this means that the complicated regexes to parse command line args are only performed once, ranther for every line of every file.
| [reply] [d/l] |
|
|
|
Summary of logic spot on. Sorry my code wasn't clear enough :-)
However, as tall_man pointed out, File::MergeSort (which I have somehow managed to miss) does exactly the same thing - and is nicely encapsulated. So I'd use that instead if it were me :-)
| [reply] |
Re: Log parsing by timestamp dilema
by tall_man (Parson) on Feb 01, 2003 at 15:36 UTC
|
What you are describing sounds like a merge sort, and there is a module File::MergeSort designed for that. It takes a list of files at once along with a key extraction subroutine.
I'm sure there's a straightforward way to implement the getFilename operation, if only by tinkering with the File::MergeSort code.
# Mostly extracted from the documentation. I have not tried it.
use File::MergeSort;
my $sort = new File::MergeSort(
\@file_list,
\&index_extract_function
);
my $line;
while (defined($line = $sort->next_line)) {
my $fname = getFilename();
print "$line/$fname\n";
}
| [reply] [d/l] |
|
| [reply] |
|
I have taken a quick look through the File::MergeSort code. Here are my suggestions about how to get the file names.
In the constructor, there is a loop that opens the files and saves the file handles. I would add a line to save the file names too:
$self->{files}->[$n]->{fh} = $fh;
$self->{files}->[$n]->{name} = $file;
In next_line, capture the value of
$self->{sorted}->[0]->{name}
when you read from the handle, and do an array
return with both the line and the filename.
That should do it. | [reply] [d/l] [select] |
Re: Log parsing by timestamp dilema
by pg (Canon) on Feb 01, 2003 at 18:00 UTC
|
- As tall_man said, this is a typical situation of merge sort. You can try the module given by tall_man.
- The other approach is to do the merge sort by yourself.
- First sort each file needs to be merged, sounds like you have program for this already.
- Have your own merge sort program do this (for example you are trying to merge n files),
- open all n files you want to merge
- open output file for merged result
- read one line from each of those n files
- compare those lines, and write out the smallest one
- if a line has been written out in step 4, then read next line from that file, if eof, close that file.
- goto 4, unless all file closed, i.e. finished (there is some subtle thing here, make sure that after all file closed, you still write out all the left over lines already read in)
This algorithm only holds n lines at the most, memory usage is carefully designed.
- Another approach is to have a single script running as log server, and all your monitors just send logs to that server thru UDP, in this way the log msgs would be in order from the beginning.
I used this approach in a real telecomminucation system, and yet speed is not a problem.
But you have to spend some time on the log server, carefully lay out things like block writing etc.
| [reply] |
|
pg,
Thanks!
I am going to see if I can't get tall_man's suggestion to work. The only obvious problem is figuring out how to hack File::MergeSort to give me file name/path. I will compare this to the speed of DaveH's integration of adrianh's approach. It added a lot of overhead to the speed of the original program, though it isn't really that big of a deal since the files are being read post write.
Your option 2 looks identical to my reply to adrianh - I am also going to try my own version of this to see if I can make it any faster.
Your option 3 is interesting, but isn't viable. I am looking for a quick turn around, but I may consider it in a future revision (I really need to start working on other projects).
Cheers - L~R
| [reply] |
Re: Log parsing by timestamp dilema
by waswas-fng (Curate) on Feb 01, 2003 at 17:49 UTC
|
Are the directories you are running the original script on static? if so maybe you could generate an index number for each and include in the log output. Also have you benchmarked the difference between including the directory/direction in the log output? I would not think that it would be that much less efficiant than writing out the log without it.
-Waswas | [reply] |
|
|