Re: About text file parsing

by davido (Cardinal)
in reply to About text file parsing

I did the following:

perl -e 'open my $outfh, ">", "sample.txt"; while ($i++ < 50_000_000) +{print $outfh "abcdefghijklmnopqrstuvwxyz0123456789\n";}'

On my laptop with an SSD that took about fifteen seconds to run. Then I did this:

perl -E 'open my $infh, "<", "sample.txt"; while(<$infh>) {$i++} say $ +i;'

And that took about eight seconds to run. In the case of your code, within the while() {...} loop you're invoking the regex engine, doing a capture, and pushing onto two arrays. If you have "hits" in the case of, say, 50% of the lines from your file, you'll be pushing 25 million captures into the arrays. Depending on the size of your captures, you could have one to several gigabytes stored in the arrays.

If your run-times for the code segment you demonstrated are under 30-45 seconds, you're probably doing about as best as can be expected for a single process working with a file. If the time is over a couple minutes, you're probably swamping memory and doing a lot of paging out behind the scenes. If that's the case, consider instead of pushing into @good and @sample arrays, writing entries to a couple of output files. This will add IO overhead to the process, but will remove the memory impact which is probably generating even more IO overhead behind the scenes at a much lower layer.

Once the 'sample' and 'good' files are written, you can process them line by line to do with them what you would have done with the arrays. Another alternative would be instead of pushing onto @sample and @good, do the processing that will later happen on @sample and @good just in time for each line of the input file. IE:

my %dispatch = ( sample => sub {my $capture = shift; # do something with $capture} +, good => sub {my $capture = shift; # do something with $capture} +, ); while(<FILE>) { if (/^(sample|good)\s+(\S+)/) { $dispatch{$1}->($2); } }

As long as # do something with $capture does not include storing the entire capture into an array, this should pretty much wipe out the large memory footprint.


Node Type: note
