in reply to Displaying/buffering huge text files
This builds an index to a million line file in around 20 seconds, and accesses and prints 10,000 lines at random in under 1 second.
If the 20 seconds is too long for startup, then you could always read enough to allow you to populate your listbox, and then push the rest of the index building off into a background thread to finish, while you populate the listbox with the first 1000 lines or so.
The indexing could be sped up by a more intelligent indexer, that reads larger chunks and searched for the newlines, rather than reading one line at a time.
The index requires just 8 MB of ram. That could be halved by using 'N' or 'V' as your pack format, rather than 'd', if your files will never go over 4GB. By using 'd' you are good for files up to around 8,500 Terabytes, which should see you through the next couple of machine changes or so:)
#! perl -slw
use strict;
$| = 1;
open FILE, '<', $ARGV[ 0 ] or die $!;
print 'Before indexing: ', time;
my $index = pack 'd', 0;
$index .= pack 'd', tell FILE while <FILE>;
print 'After indexing: ', time;
print 'Size of index: ', length $index;
for my $i ( map{ int rand( length( $index )/8 ) } 1 .. 10_000 ) {
my $line = unpack( 'd', substr $index, $i*8, 8 );
seek FILE, $line, 0;
chomp( $_ = <FILE> );
printf "\r$line : '%s'", $_;
}
print "\nAfter reading 10,000 random lines: ", time;
__END__
P:\test>433953 data\1millionlines.dat
Before indexing: 1109146372
After indexing: 1109146392
Size of index: 8000008
1865840 : '00186585'
After reading 10,000 random lines: 1109146393
Examine what is said, not who speaks.
Silence betokens consent.
Love the truth but pardon error.
Re^2: Displaying/buffering huge text files
by spurperl (Priest) on Feb 23, 2005 at 08:35 UTC
|
First of all, thanks from the detailed answer. One can always expect that from you :-)
Now, indexing may indeed be my way to go. In fact, I may be needing to do it in C++. I also ran indexing, it took 14 seconds on a 3e6 line file (the file itself is ~60MB) using STL streams and 8 seconds using C file access (fgets(), ftell()). The memory consumption is pretty good, only 12 MB for these 3e6 files (in a vector). Accessing random lines with fseek() is, as expected, almost immediate.
This closes in on acceptable, and I'm starting to believe that it is indeed possible to keep it in memory. (I don't need to handle files > 4 GB).
Another problem that I forgot to mention is that the file is being updated "live" and I should keep up with it. I can get notifications, and probably just add new indexes (the file always grows, never shrinks). | [reply] |
|
Yes, appending to the index should cause no problems at all. As your limiting yourself to under 4 GB, then using pack 'J' (perl native unsigned 32-bit which saves a little conversion) rather 'd' effects a speedup of the indexing of around x4 giving under 5 seconds for the 1e6 file.
#! perl -slw
use strict;
$| = 1;
open FILE, '<', $ARGV[ 0 ] or die $!;
print 'Before indexing: ', time;
my $index = pack 'J', 0;
$index .= pack 'J', tell FILE while <FILE>;
print 'After indexing: ', time;
print 'Size of index: ', length $index;
for my $i ( map{ int rand( length( $index )/4 ) } 1 .. 10_000 ) {
my $line = unpack( 'J', substr $index, $i*4, 4 );
seek FILE, $line, 0;
chomp( $_ = <FILE> );
printf "\r$line : '%s'", $_;
}
print "\nAfter reading 10,000 random lines: ", time;
__END__
P:\test>433953 data\1millionlines.dat
Before indexing: 1109148435
After indexing: 1109148440
Size of index: 4000004
1087640 : '00108765'
After reading 10,000 random lines: 1109148441
Almost quick enough that you could avoid dropping to C++ :)
Win32 also supports Memory Mapped Files natively, complete with Copy-on-Write where applicable. I also think that tye's Win32API::File may give you access to some, if not all the apis required to use them. I'm not sure that it would work out any quicker than indexing though and the fact that the file is being written to may cause problems.
Examine what is said, not who speaks.
Silence betokens consent.
Love the truth but pardon error.
| [reply] [d/l] |
|
| [reply] |
Re^2: Displaying/buffering huge text files
by Rudif (Hermit) on Mar 28, 2005 at 19:39 UTC
|
Hi BrowserUk
Your indexing program is very neat. I never realized that it can be so simple and effective (in perl). Thanks!
I tried it on some large files (50 to 500 MB, with average line length about 150 characters).
I quickly spotted speed a problem with
$index .= pack 'd', tell FILE while <FILE>;
... it has to copy all previously packed data in every .= operation, so that the time grows with the square of number of lines. A big oh, O(x^2) to be precise.
Here is my (almost) drop-in replacement which trades memory space for indexing time
my @index = ( pack 'd', 0 );
push @index, pack 'd', tell FILE while <FILE>;
pop @index;
my $index = join '', @index;
... and the timing that shows roughly O(x) times for mine, and O(x^2) for yours (you can see the parabola in the table, if you look at it sideways).
In the last test case (the 527 MB file) with my script version the process memory usage peaked at +270 MB for a final index size of 27.5 MB.
Indexing mine : time 1 s, size 19949431, lines 136126
Indexing mine : time 2 s, size 40308893, lines 258457
Indexing mine : time 5 s, size 95227350, lines 634392
Indexing mine : time 29 s, size 527423877, lines 3441911
Indexing yours: time 2 s, size 19949431, lines 136127
Indexing yours: time 6 s, size 40308893, lines 258458
Indexing yours: time 31 s, size 95227350, lines 634393
Indexing yours: time 809 s, size 527423877, lines 3441912
I also added pop @index;, to get rid of the last index - it points to the end of the file, after the last line.
Rudif
| [reply] [d/l] [select] |
Re^2: Displaying/buffering huge text files
by Anonymous Monk on Dec 22, 2011 at 17:28 UTC
|
Hey! I was looking for a way to handle big files and this looks awesome. I already tried and runs so cool. However, I am kind newbie on Perl. I was trying to modify in order to get secuencial access, not random. I haven't good luck on this. I just tried to remove the int rand( length( $index )/8) and let this with length of $index. Didn't work, the seek doesn't move the pointer in the file. Could someone can give me a hint? Thank you in advance! | [reply] |
|
I was trying to modify in order to get secuencial access, not random.
Sorry, but that makes no sense at all. You have to read the entire file sequentially once in order to build an index.
If you want to process the file sequentially, just read it sequentially. There is nothing to be gained by using an index unless you need random access.
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
| [reply] |
|
Us N instead of non-portable J.
my $index = pack 'N', 0;
$index .= pack 'N', tell FILE while <FILE>;
for( my $line_idx = $s; $line_idx < $fileLastLine; $line_idx++ ) {
my $line = unpack( 'N', substr($index, $line_idx*4, 4));
seek FILE, $line, 0;
chomp( $_ = <FILE> );
my $line_str = $_;
| [reply] [d/l] |
|
Us N instead of non-portable J.
'J' may be non-portable, but why oh why would you build an index on one machine and the export it to another? (Hint:You wouldn't unless you are naive!)
The reason for using 'J' in preference to 'N', is that it is 'native'. That is, whatever architecture of machine you use it on it uses the format that is natural, and therefore most efficient, for that machine.
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
| [reply] |
|
In case it isn't clear, $s is for start, the line to start on. For $fileLastLine I used:
my $fileLastLine = `wc -l ${filename}`;
chomp $fileLastLine;
$fileLastLine = int ($fileLastLine);
| [reply] [d/l] |
|
|