Re: print log file
by BerntB (Deacon) on Nov 30, 2006 at 06:30 UTC
I assume that your real problem is that you want to traverse a large log file fast? There are certainly some CPAN module which do this -- and someone will post about it and make this look primitive. :-)
You could try a binary search with seek() to look for the general area to start looking. Something like:
# Find a good place to start traversing a large
# file for sorted data.
return 0 if $filelen < 2000000; # A few MB? Please... :-)
my $safe_offset = 0;
my $jump = 0.5;
my $step = 0.25;
for(1..8) {
# See if
my($testoff) = int($filelen * $jump);
seek(FILE, $testoff, 0);
if ( do_test(*FILE, $threshold) ) {
$safe_offset = $testoff; # Point to offset for
$jump = $jump + $step;
} else {
$jump = $jump - $step;
$step = $step / 2.0;
# Go to selected place:
seek(FILE, $safe_offset);
<FILE> if $safe_offset > 0;
return $safe_offset;
sub do_test {
my(*FILE, $compare) = @_;
# Has moved somewhere in file. Skip partial line of log:
my $log = <FILE>;
# You write this (you know the format). Return true if OK:
return date_test($log, $compare);
Just an idea, ignore if my assumption about your problem was wrong. Code is untested since I'm busy. I'll be back in about a work day and can write more then, if you need more details.
I hope this won't embarrass me when I get back. (-: On the other hand -- I'll make someone's day when they get to point out errors. :-)
Update: Added return stuff, so it sets up offset to file correctly.
| [reply] [d/l] [select] |
Since there is always more than one way to do it. Here would be my guess to traverse the file faster:
my $jmp_distance = 100;
my $treshhold = $FOO
my $line = <FILE>;
(my $tstamp, my $rest) = split(/\s+/,$line,2);
#skipping through file by jumping over $jmp_distance lines
do {
$. += $jmp_distance;
$line = <FILE>;
($tstamp, $rest) = split(/\s+/,$line,2);
} while ($tstamp<=$treshhold)
#since we overshot with the last jump
$. -= $jmp_distance;
while (<FILE>) {
($tstamp, $rest) = split(/\s+/,$_,2);
print if $tstamp > $treshhold;
Of course you can modify this further. For example putting the whole do whileloop in a sub and recursively call it with shorter jmp_distances. that way you could start with a distance of 10000 and then go to shorter distances like 1000, 100, 10 and 1 (1 = condition to break the recursion) if you call the sub with jmp_distance/10.
-- Terry Pratchett, "Reaper Man"
| [reply] [d/l] |
Wow,code of great wisdom!although it isn't for my problem. When will you be back? I can't wait to read more!
| [reply] |
Hrm, I'll take that for humorous irony and not sarcasm. :-)
| [reply] |
hmm first: seek is byte oriented not line oriented. As far as i understood it he needs one line at a time. This isnt much of a problem when you have a file with fixed length lines but since we are talking about a logfile this is doubtful (although possible). So while seek is probably the fastest way to move around a file i'm afraid you cant use it here.
-- Terry Pratchett, "Reaper Man"
| [reply] |
That is why the do_test routine skipped the first line it read. It said:
# Has moved somewhere in file. Skip partial line of log:
This just attempts to make the log traversal smaller, not perfect. (Wouldn't be so hard either, of course. But I'm out the door in a few minutes. :-)
| [reply] [d/l] [select] |
Re: print log file
by BUU (Prior) on Nov 30, 2006 at 04:12 UTC
print if( (timestamp > thresh) .. 0 ); | [reply] [d/l] |
As BUU said: use the flipflop operator. it takes the form of expression1 .. expression2 and evaluates as false until expression1 is met and then triggers to true until expression2 is met
So basically what BUU does is setting expression1 timestamp > threshhold so that it is met by your criteria and expression2 is 0 which equivalates to false. So that after expression1 is met for the first time the if statement continues to return true until the whole file is processed.
If you want to know more about the flip/flop operator (and why it's easy to confuse with the range operator i recommend GrandFathers excellent Flipin good, or a total flop?
-- Terry Pratchett, "Reaper Man"
| [reply] [d/l] [select] |
thank you!
below is benchmark:(sub a is my code)
Benchmark: timing 200 iterations of a, b...
a: 26 wallclock secs (25.38 usr + 0.23 sys = 25.62 CPU) @ 7
+.81/s (n=200)
b: 24 wallclock secs (23.77 usr + 0.22 sys = 23.98 CPU) @ 8
+.34/s (n=200)
Rate a b
a 7.81/s -- -6%
b 8.34/s 7% --
| [reply] [d/l] |
Re: print log file
by fenLisesi (Priest) on Nov 30, 2006 at 07:00 UTC
use strict;
use warnings;
my $THRESHOLD = 2.5;
while (my $line = <DATA>) {
if ($line =~ /^(\d+)/) {
my $stamp = $1;
if ($stamp > $THRESHOLD) {
process_line( $line );
else {
warn "Bad line: $line";
while (my $line = <DATA>) {
process_line( $line );
sub process_line {
print $_[0];
1 llama
2 alpaca
3 camel
4 badger
3 camel
4 badger
HTH. | [reply] [d/l] [select] |
Re: print log file
by johngg (Canon) on Nov 30, 2006 at 10:56 UTC
Perhaps you could use Tie::File so that you can access the log file as if it was an array. You could then quickly find the index in the array where the timestamp passes the threshhold by using a binary chop. Look half-way along the array, check the timestamp. If it is less than the threshhold, look again half-way along the upper half, else half-way along the lower half. Repeat, each time chopping the number of elements to be searched in half. Once you have found the element you can print from there to the end of the array.Just a thought.
Cheers, JohnGG | [reply] [d/l] |
| [reply] |
It must have been 25 years ago that I was told about this technique and the name given then was binary chop, I guess because you successively chop the range to be searched in half. Perhaps the name depends on which side of the Atlantic you live. Strangely enough, until today, when I coded one to test if my suggestion worked (it did), I had never had occaision to use one.
Cheers, JohnGG
| [reply] |
Re: print log file
by johngg (Canon) on Dec 01, 2006 at 11:49 UTC
I have written some code to test whether Tie::File and a binary (?:chop|search) as suggested here would actually work. I wrote a script
to generate a fictional log file of 47,000 + lines (about 2.5MB). The DateMunge::dateStr is something I wrote years ago before I had access to CPAN or knew about POSIX::strftime. I then wrote the code to test the solution. It seems to work fairly well, taking about 5 seconds to find the correct line in the log running on a SPARC Ultra 30 with 300MHz cpu. Currently the threshhold is hard-coded in the script but it could easily be changed to a command-line argument. Here's the code
use strict;
use warnings;
use Tie::File;
use Fcntl q{O_RDONLY};
my $logFile = q{spw586856.log};
tie my @logLines, q{Tie::File}, $logFile,
mode => O_RDONLY,
autochomp => 0,
or die qq{tie: $logFile: $!\n};
my $threshhold = q{2006-06-19.11:47:25};
my $threshholdIdx = -1;
my $firstIdx = 0;
my $lastIdx = $#logLines;
if ($threshhold lt getDate($logLines[0]))
die qq{Threshhold date before range in $logFile\n};
elsif ($threshhold gt getDate($logLines[-1]))
die qq{Threshhold date after range in $logFile\n};
while (1)
if ($threshhold eq getDate($logLines[$firstIdx]))
$threshholdIdx = $firstIdx;
last BIN_CHOP;
my $idxDiff = $lastIdx - $firstIdx;
if ($idxDiff < 2)
$threshholdIdx = $lastIdx;
last BIN_CHOP;
my $midIdx = $firstIdx + int($idxDiff / 2);
if ($threshhold eq getDate($logLines[$midIdx]))
while (1)
$midIdx --
if $threshhold eq getDate($logLines[$midIdx - 1])
$threshholdIdx = $midIdx;
last BIN_CHOP;
if ($threshhold lt getDate($logLines[$midIdx]))
$lastIdx = $midIdx;
next BIN_CHOP;
if ($threshhold gt getDate($logLines[$midIdx]))
$firstIdx = $midIdx;
next BIN_CHOP;
die qq{Internal error, how did we get here?\n};
die qq{Binary chop did not find threshhold\n}
if $threshholdIdx == -1;
qq{Threshhold : $threshhold\n},
qq{Line No. : @{[$threshholdIdx + 1]}\n},
qq{Log msg. : $logLines[$threshholdIdx]},
qq{Prev. msg. : $logLines[$threshholdIdx - 1]\n};
qq{Lines from threshhold onwards\n\n},
@logLines[$threshholdIdx .. $#logLines];
sub getDate
my $line = shift;
my ($date) = $line =~ m{^(\S+)};
return $date;
and here's some output
$ ls -l spw586856*
-rwxr-xr-x 1 jgillman og5a 1895 Dec 1 11:11 spw586856
-rw-r--r-- 1 jgillman og5a 2507626 Dec 1 10:58 spw586856.log
-rwxr-xr-x 1 jgillman og5a 696 Dec 1 10:56 spw586856makeDat
$ wc spw586856.log
47712 332470 2507626 spw586856.log
$ head -10 spw586856.log
2006-06-11.05:26:40 Random message
2006-06-11.05:27:13 This message is intentionally blank
2006-06-11.05:28:00 Chickens have got into the server
2006-06-11.05:28:33 The lunatics have taken over the asylum
2006-06-11.05:29:16 Chickens have got into the server
2006-06-11.05:29:53 The lunatics have taken over the asylum
2006-06-11.05:30:34 The disk drive just wants you to know it is fine
2006-06-11.05:30:52 The lunatics have taken over the asylum
2006-06-11.05:31:32 All your data is gone
2006-06-11.05:31:52 All your data is gone
$ tail -10 spw586856.log
2006-06-19.11:51:40 Chickens have got into the server
2006-06-19.11:51:56 Random message
2006-06-19.11:52:44 Random message
2006-06-19.11:52:50 All your data is gone
2006-06-19.11:52:58 This message is intentionally blank
2006-06-19.11:53:44 The lunatics have taken over the asylum
2006-06-19.11:54:07 This message is intentionally blank
2006-06-19.11:54:35 The disk drive just wants you to know it is fine
2006-06-19.11:54:51 Chickens have got into the server
2006-06-19.11:55:33 Chickens have got into the server
$ time spw586856
Threshhold : 2006-06-19.11:47:25
Line No. : 47694
Log msg. : 2006-06-19.11:47:27 This message is intentionally blank
Prev. msg. : 2006-06-19.11:47:20 Chickens have got into the server
Lines from threshhold onwards
2006-06-19.11:47:27 This message is intentionally blank
2006-06-19.11:47:47 The disk drive just wants you to know it is fine
2006-06-19.11:48:36 The lunatics have taken over the asylum
2006-06-19.11:49:12 All your data is gone
2006-06-19.11:49:28 This message is intentionally blank
2006-06-19.11:50:01 Chickens have got into the server
2006-06-19.11:50:10 This message is intentionally blank
2006-06-19.11:50:38 Chickens have got into the server
2006-06-19.11:51:24 Random message
2006-06-19.11:51:40 Chickens have got into the server
2006-06-19.11:51:56 Random message
2006-06-19.11:52:44 Random message
2006-06-19.11:52:50 All your data is gone
2006-06-19.11:52:58 This message is intentionally blank
2006-06-19.11:53:44 The lunatics have taken over the asylum
2006-06-19.11:54:07 This message is intentionally blank
2006-06-19.11:54:35 The disk drive just wants you to know it is fine
2006-06-19.11:54:51 Chickens have got into the server
2006-06-19.11:55:33 Chickens have got into the server
real 0m5.28s
user 0m4.86s
sys 0m0.34s
I chose a threshhold near the end of the file to demonstrate the printing of all lines without scrolling to death but the search seems to work on all threshholds that I tested. I hope this is of interest. Cheers, JohnGG | [reply] [d/l] [select] |