Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

print log file

by xiaoyafeng (Deacon)
on Nov 30, 2006 at 04:08 UTC ( [id://586856]=perlquestion: print w/replies, xml ) Need Help??

xiaoyafeng has asked for the wisdom of the Perl Monks concerning the following question:

hi gurus,

I wonder that's a simple question,but i can't find a better way.I have a log file as below:
timestamp description timestamp description .........
Now, I need to print all description after some timestamp.
below is my code:
..... while (<>) { print if timestamp>threshold; }
my question: I don't wanna compare timestamp every loop,because log file is sorted by timestamp.That is,if one timestamp is greater than threshold, print the rest of all (no comparison). How can do that?


all reply would be appreciated!



UPDATE:
Thanks for your help to let me know my stupidity!In fact,my question is:
how traverse a large sorted(indexed) file faster?
According to your reply,I have three way to do it:
1. flipflop operator
2. binary seek
3. Tie::File (array)
which one is the best? I think nobody but benchmark module could answer it. :-)
Thanks again!

Replies are listed 'Best First'.
Re: print log file
by BUU (Prior) on Nov 30, 2006 at 04:12 UTC
    print if( (timestamp > thresh) .. 0 );
      As BUU said: use the flipflop operator. it takes the form of expression1 .. expression2 and evaluates as false until expression1 is met and then triggers to true until expression2 is met

      So basically what BUU does is setting expression1 timestamp > threshhold so that it is met by your criteria and expression2 is 0 which equivalates to false. So that after expression1 is met for the first time the if statement continues to return true until the whole file is processed.

      If you want to know more about the flip/flop operator (and why it's easy to confuse with the range operator i recommend GrandFathers excellent Flipin good, or a total flop?

      --
      "WHAT CAN THE HARVEST HOPE FOR IF NOT THE CARE OF THE REAPER MAN"
      -- Terry Pratchett, "Reaper Man"

      thank you! below is benchmark:(sub a is my code)
      Benchmark: timing 200 iterations of a, b... a: 26 wallclock secs (25.38 usr + 0.23 sys = 25.62 CPU) @ 7 +.81/s (n=200) b: 24 wallclock secs (23.77 usr + 0.22 sys = 23.98 CPU) @ 8 +.34/s (n=200) Rate a b a 7.81/s -- -6% b 8.34/s 7% --
Re: print log file
by BerntB (Deacon) on Nov 30, 2006 at 06:30 UTC
    I assume that your real problem is that you want to traverse a large log file fast? There are certainly some CPAN module which do this -- and someone will post about it and make this look primitive. :-)

    You could try a binary search with seek() to look for the general area to start looking. Something like:

    # Find a good place to start traversing a large # file for sorted data. return 0 if $filelen < 2000000; # A few MB? Please... :-) my $safe_offset = 0; my $jump = 0.5; my $step = 0.25; for(1..8) { # See if my($testoff) = int($filelen * $jump); seek(FILE, $testoff, 0); if ( do_test(*FILE, $threshold) ) { $safe_offset = $testoff; # Point to offset for $jump = $jump + $step; } else { $jump = $jump - $step; } $step = $step / 2.0; } # Go to selected place: seek(FILE, $safe_offset); <FILE> if $safe_offset > 0; return $safe_offset; sub do_test { my(*FILE, $compare) = @_; # Has moved somewhere in file. Skip partial line of log: <FILE>; my $log = <FILE>; # You write this (you know the format). Return true if OK: return date_test($log, $compare); }

    Just an idea, ignore if my assumption about your problem was wrong. Code is untested since I'm busy. I'll be back in about a work day and can write more then, if you need more details.

    I hope this won't embarrass me when I get back. (-: On the other hand -- I'll make someone's day when they get to point out errors. :-)

    Update: Added return stuff, so it sets up offset to file correctly.

      Since there is always more than one way to do it. Here would be my guess to traverse the file faster:
      my $jmp_distance = 100; my $treshhold = $FOO my $line = <FILE>; (my $tstamp, my $rest) = split(/\s+/,$line,2); #skipping through file by jumping over $jmp_distance lines do { $. += $jmp_distance; $line = <FILE>; ($tstamp, $rest) = split(/\s+/,$line,2); } while ($tstamp<=$treshhold) #since we overshot with the last jump $. -= $jmp_distance; while (<FILE>) { ($tstamp, $rest) = split(/\s+/,$_,2); print if $tstamp > $treshhold; }
      Of course you can modify this further. For example putting the whole do whileloop in a sub and recursively call it with shorter jmp_distances. that way you could start with a distance of 10000 and then go to shorter distances like 1000, 100, 10 and 1 (1 = condition to break the recursion) if you call the sub with jmp_distance/10.

      --
      "WHAT CAN THE HARVEST HOPE FOR IF NOT THE CARE OF THE REAPER MAN"
      -- Terry Pratchett, "Reaper Man"

      Wow,code of great wisdom!although it isn't for my problem. When will you be back? I can't wait to read more!
        Hrm, I'll take that for humorous irony and not sarcasm. :-)
      hmm first: seek is byte oriented not line oriented. As far as i understood it he needs one line at a time. This isnt much of a problem when you have a file with fixed length lines but since we are talking about a logfile this is doubtful (although possible). So while seek is probably the fastest way to move around a file i'm afraid you cant use it here.

      --
      "WHAT CAN THE HARVEST HOPE FOR IF NOT THE CARE OF THE REAPER MAN"
      -- Terry Pratchett, "Reaper Man"

        That is why the do_test routine skipped the first line it read. It said:

        # Has moved somewhere in file. Skip partial line of log: <FILE>;
        This just attempts to make the log traversal smaller, not perfect. (Wouldn't be so hard either, of course. But I'm out the door in a few minutes. :-)
Re: print log file
by fenLisesi (Priest) on Nov 30, 2006 at 07:00 UTC
    use strict; use warnings; my $THRESHOLD = 2.5; LOG_ENTRY: while (my $line = <DATA>) { if ($line =~ /^(\d+)/) { my $stamp = $1; if ($stamp > $THRESHOLD) { process_line( $line ); last LOG_ENTRY; } } else { warn "Bad line: $line"; } } while (my $line = <DATA>) { process_line( $line ); } exit(0); sub process_line { print $_[0]; } __DATA__ 1 llama 2 alpaca 3 camel 4 badger

    prints:

    3 camel 4 badger

    HTH.

Re: print log file
by johngg (Canon) on Nov 30, 2006 at 10:56 UTC
    Perhaps you could use Tie::File so that you can access the log file as if it was an array. You could then quickly find the index in the array where the timestamp passes the threshhold by using a binary chop. Look half-way along the array, check the timestamp. If it is less than the threshhold, look again half-way along the upper half, else half-way along the lower half. Repeat, each time chopping the number of elements to be searched in half. Once you have found the element you can print from there to the end of the array.

    Just a thought.

    Cheers,

    JohnGG

      Also known as a binary search (term must have changed since I went to college).

      --MidLifeXis

        It must have been 25 years ago that I was told about this technique and the name given then was binary chop, I guess because you successively chop the range to be searched in half. Perhaps the name depends on which side of the Atlantic you live. Strangely enough, until today, when I coded one to test if my suggestion worked (it did), I had never had occaision to use one.

        Cheers,

        JohnGG

Re: print log file
by johngg (Canon) on Dec 01, 2006 at 11:49 UTC
    I have written some code to test whether Tie::File and a binary (?:chop|search) as suggested here would actually work. I wrote a script

    to generate a fictional log file of 47,000 + lines (about 2.5MB). The DateMunge::dateStr is something I wrote years ago before I had access to CPAN or knew about POSIX::strftime. I then wrote the code to test the solution. It seems to work fairly well, taking about 5 seconds to find the correct line in the log running on a SPARC Ultra 30 with 300MHz cpu. Currently the threshhold is hard-coded in the script but it could easily be changed to a command-line argument. Here's the code

    use strict; use warnings; use Tie::File; use Fcntl q{O_RDONLY}; my $logFile = q{spw586856.log}; tie my @logLines, q{Tie::File}, $logFile, mode => O_RDONLY, autochomp => 0, or die qq{tie: $logFile: $!\n}; my $threshhold = q{2006-06-19.11:47:25}; my $threshholdIdx = -1; my $firstIdx = 0; my $lastIdx = $#logLines; if ($threshhold lt getDate($logLines[0])) { die qq{Threshhold date before range in $logFile\n}; } elsif ($threshhold gt getDate($logLines[-1])) { die qq{Threshhold date after range in $logFile\n}; } BIN_CHOP: while (1) { if ($threshhold eq getDate($logLines[$firstIdx])) { $threshholdIdx = $firstIdx; last BIN_CHOP; } my $idxDiff = $lastIdx - $firstIdx; if ($idxDiff < 2) { $threshholdIdx = $lastIdx; last BIN_CHOP; } my $midIdx = $firstIdx + int($idxDiff / 2); if ($threshhold eq getDate($logLines[$midIdx])) { STEP_LEFT: while (1) { $midIdx -- if $threshhold eq getDate($logLines[$midIdx - 1]) } $threshholdIdx = $midIdx; last BIN_CHOP; } if ($threshhold lt getDate($logLines[$midIdx])) { $lastIdx = $midIdx; next BIN_CHOP; } if ($threshhold gt getDate($logLines[$midIdx])) { $firstIdx = $midIdx; next BIN_CHOP; } die qq{Internal error, how did we get here?\n}; } die qq{Binary chop did not find threshhold\n} if $threshholdIdx == -1; print qq{Threshhold : $threshhold\n}, qq{Line No. : @{[$threshholdIdx + 1]}\n}, qq{Log msg. : $logLines[$threshholdIdx]}, qq{Prev. msg. : $logLines[$threshholdIdx - 1]\n}; print qq{Lines from threshhold onwards\n\n}, @logLines[$threshholdIdx .. $#logLines]; sub getDate { my $line = shift; my ($date) = $line =~ m{^(\S+)}; return $date; }

    and here's some output

    $ ls -l spw586856* -rwxr-xr-x 1 jgillman og5a 1895 Dec 1 11:11 spw586856 -rw-r--r-- 1 jgillman og5a 2507626 Dec 1 10:58 spw586856.log -rwxr-xr-x 1 jgillman og5a 696 Dec 1 10:56 spw586856makeDat +a $ wc spw586856.log 47712 332470 2507626 spw586856.log $ head -10 spw586856.log 2006-06-11.05:26:40 Random message 2006-06-11.05:27:13 This message is intentionally blank 2006-06-11.05:28:00 Chickens have got into the server 2006-06-11.05:28:33 The lunatics have taken over the asylum 2006-06-11.05:29:16 Chickens have got into the server 2006-06-11.05:29:53 The lunatics have taken over the asylum 2006-06-11.05:30:34 The disk drive just wants you to know it is fine 2006-06-11.05:30:52 The lunatics have taken over the asylum 2006-06-11.05:31:32 All your data is gone 2006-06-11.05:31:52 All your data is gone $ tail -10 spw586856.log 2006-06-19.11:51:40 Chickens have got into the server 2006-06-19.11:51:56 Random message 2006-06-19.11:52:44 Random message 2006-06-19.11:52:50 All your data is gone 2006-06-19.11:52:58 This message is intentionally blank 2006-06-19.11:53:44 The lunatics have taken over the asylum 2006-06-19.11:54:07 This message is intentionally blank 2006-06-19.11:54:35 The disk drive just wants you to know it is fine 2006-06-19.11:54:51 Chickens have got into the server 2006-06-19.11:55:33 Chickens have got into the server $ time spw586856 Threshhold : 2006-06-19.11:47:25 Line No. : 47694 Log msg. : 2006-06-19.11:47:27 This message is intentionally blank Prev. msg. : 2006-06-19.11:47:20 Chickens have got into the server Lines from threshhold onwards 2006-06-19.11:47:27 This message is intentionally blank 2006-06-19.11:47:47 The disk drive just wants you to know it is fine 2006-06-19.11:48:36 The lunatics have taken over the asylum 2006-06-19.11:49:12 All your data is gone 2006-06-19.11:49:28 This message is intentionally blank 2006-06-19.11:50:01 Chickens have got into the server 2006-06-19.11:50:10 This message is intentionally blank 2006-06-19.11:50:38 Chickens have got into the server 2006-06-19.11:51:24 Random message 2006-06-19.11:51:40 Chickens have got into the server 2006-06-19.11:51:56 Random message 2006-06-19.11:52:44 Random message 2006-06-19.11:52:50 All your data is gone 2006-06-19.11:52:58 This message is intentionally blank 2006-06-19.11:53:44 The lunatics have taken over the asylum 2006-06-19.11:54:07 This message is intentionally blank 2006-06-19.11:54:35 The disk drive just wants you to know it is fine 2006-06-19.11:54:51 Chickens have got into the server 2006-06-19.11:55:33 Chickens have got into the server real 0m5.28s user 0m4.86s sys 0m0.34s $

    I chose a threshhold near the end of the file to demonstrate the printing of all lines without scrolling to death but the search seems to work on all threshholds that I tested. I hope this is of interest.

    Cheers,

    JohnGG

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://586856]
Approved by GrandFather
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chilling in the Monastery: (5)
As of 2024-09-16 01:39 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    The PerlMonks site front end has:





    Results (21 votes). Check out past polls.

    Notices?
    erzuuli‥ 🛈The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.