Your indexing program is very neat. I never realized that it can be so simple and effective (in perl). Thanks!
I tried it on some large files (50 to 500 MB, with average line length about 150 characters).
I quickly spotted speed a problem with
... it has to copy all previously packed data in every .= operation, so that the time grows with the square of number of lines. A big oh, O(x^2) to be precise.$index .= pack 'd', tell FILE while <FILE>;
Here is my (almost) drop-in replacement which trades memory space for indexing time
... and the timing that shows roughly O(x) times for mine, and O(x^2) for yours (you can see the parabola in the table, if you look at it sideways).my @index = ( pack 'd', 0 ); push @index, pack 'd', tell FILE while <FILE>; pop @index; my $index = join '', @index;
In the last test case (the 527 MB file) with my script version the process memory usage peaked at +270 MB for a final index size of 27.5 MB.
I also added pop @index;, to get rid of the last index - it points to the end of the file, after the last line.Indexing mine : time 1 s, size 19949431, lines 136126 Indexing mine : time 2 s, size 40308893, lines 258457 Indexing mine : time 5 s, size 95227350, lines 634392 Indexing mine : time 29 s, size 527423877, lines 3441911 Indexing yours: time 2 s, size 19949431, lines 136127 Indexing yours: time 6 s, size 40308893, lines 258458 Indexing yours: time 31 s, size 95227350, lines 634393 Indexing yours: time 809 s, size 527423877, lines 3441912