Your indexing program is very neat. I never realized that it can be so simple and effective (in perl). Thanks!
I tried it on some large files (50 to 500 MB, with average line length about 150 characters).
I quickly spotted speed a problem with
$index .= pack 'd', tell FILE while <FILE>;
... it has to copy all previously packed data in every .= operation, so that the time grows with the square of number of lines. A big oh, O(x^2) to be precise.
Here is my (almost) drop-in replacement which trades memory space for indexing time
my @index = ( pack 'd', 0 );
push @index, pack 'd', tell FILE while <FILE>;
my $index = join '', @index;
... and the timing that shows roughly O(x) times for mine, and O(x^2) for yours (you can see the parabola in the table, if you look at it sideways).
In the last test case (the 527 MB file) with my script version the process memory usage peaked at +270 MB for a final index size of 27.5 MB.
Indexing mine : time 1 s, size 19949431, lines 136126
Indexing mine : time 2 s, size 40308893, lines 258457
Indexing mine : time 5 s, size 95227350, lines 634392
Indexing mine : time 29 s, size 527423877, lines 3441911
Indexing yours: time 2 s, size 19949431, lines 136127
Indexing yours: time 6 s, size 40308893, lines 258458
Indexing yours: time 31 s, size 95227350, lines 634393
Indexing yours: time 809 s, size 527423877, lines 3441912
I also added pop @index;
, to get rid of the last index - it points to the end of the file, after the last line.
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.
| & || & |
| < || < |
| > || > |
| [ || [ |
| ] || ] |