Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery

Halve the difference

by tachyon (Chancellor)
on Aug 22, 2001 at 15:39 UTC ( #106920=note: print w/replies, xml ) Need Help??

in reply to Removing old records from log files

Further to my last post here is an implemetation of the halve the difference method. Uncomment the 3 lines at the start to generate an 8MB test.file. We find our desired reference point in 20 tries (worst case scenario) in a few milliseconds and then dump the rest of the file (3 lines). Assuming you are going to have to work with dates you will need to modify this of course so you can compare if you are before or after your desired start but the principle holds. The total run time should be only a fraction over the time it take to write your output file. Rename it and you are done. You *will* get an infinite loop if your $find_this is not in the file so we abort if $count > $max_tries. With $maxtries set to 100 you are ok for a file with up to 2**100 lines (10**30 in rough terms :-)

my $file = 'c:/test.file'; #open F, ">$file" or $!; #print F "$_\n" for 1..1000000; #exit; my $find_this = 999997; my $file_size = -s $file; my $top = 0; my $bot = $file_size; my $count = 0; my $max_tries = 100; open OLD, $file or die $!; while (++$count) { my $middle = int(($top+$bot)/2); seek OLD, $middle , 0; my $partial = <OLD>; my $full_line = <OLD>; chomp $full_line; if ($full_line eq $find_this) { print "Took $count tries\n"; print while <OLD>; exit; } if ($full_line < $find_this) { $top = $middle; } else { $bot = $middle; } die "Ark, baling out of infinite loop" if $count > $max_tries; }

Let us know how you get on.




Replies are listed 'Best First'.
by claree0 (Hermit) on Aug 22, 2001 at 17:00 UTC

    Well, I've made some mods to your sample code, and taken the trim time on my sample file from 2m26s to 0.4 seconds. Wow!

    In the code below, I haven't included the subroutine to calculate the epoch-second date of each line 'cos it's longer htan the rest of the file!

    Thank you, Tachyon!

    #!/usr/local/perl -w use strict; my $file ='current.txt'; my $daystokeep = $ARGV[0]; my $secs_to_keep = $daystokeep * 3600 * 24; my $now = time(); my $earliest = $now - $secs_to_keep; my $file_size = -s $file; my $top = 0; my $bottom = $file_size; my $count = 0; my $max_tries = 100; open (OLD, "$file") or die $!; open (NEW, ">new.txt") or die $!; while (++$count) { my $middle = int (($top + $bottom) / 2); seek OLD, $middle, 0; my $partial = <OLD>; my $full = <OLD>; my $next = <OLD>; if (((linesecs($full)) < $earliest) && ((linesecs($next)) > $e +arliest)) { print NEW $next; print NEW while <OLD>; exit; } if ((linesecs($full)) < $earliest) { $top = $middle; } else { $bottom = $middle; } } close OLD; close NEW;

      Wow, 36500% faster. That's a worthwhile saving. Glad it helped. It's always good to use a geometric search rather than a linear one when you have any form of sorted data that you can use the split the dif algoritm on. The number of tries to find the desired position is given by:

      print "Num items Geom avg Lin avg Lin:Geom\n"; for ( my $num_items = 2; $num_items < 2<<20; $num_items <<= 1 ) { # geometric my $geom_max = int(log($num_items)/log(2))+1; my $geom_avg = int(log(($num_items/2))/log(2))+1; # linear my $lin_max = $num_items; my $lin_avg = $num_items/2; printf "%8d %8d %8d %8d\n", $num_items, $geom_avg, $lin_avg, $lin_avg/$geom_avg; }
      Should wrap it in a module one day :-)




Re: Halve the difference
by VSarkiss (Monsignor) on Aug 22, 2001 at 18:49 UTC

    Very clever, tachyon ++. I first thought "How do you do binary search in a file with variable-length records?" Your answer is simple and effective.

    Please consider linking to this from Tutorials.

      I have wrapped the concept in a module with a variety of useful widgets. It's at File::Seek




Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://106920]
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others meditating upon the Monastery: (8)
As of 2017-09-21 17:47 GMT
Find Nodes?
    Voting Booth?
    During the recent solar eclipse, I:

    Results (251 votes). Check out past polls.