Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation

Re: About text file parsing

by davido (Cardinal)
on Aug 29, 2018 at 18:45 UTC ( #1221322=note: print w/replies, xml ) Need Help??

in reply to About text file parsing

I did the following:

perl -e 'open my $outfh, ">", "sample.txt"; while ($i++ < 50_000_000) +{print $outfh "abcdefghijklmnopqrstuvwxyz0123456789\n";}'

On my laptop with an SSD that took about fifteen seconds to run. Then I did this:

perl -E 'open my $infh, "<", "sample.txt"; while(<$infh>) {$i++} say $ +i;'

And that took about eight seconds to run. In the case of your code, within the while() {...} loop you're invoking the regex engine, doing a capture, and pushing onto two arrays. If you have "hits" in the case of, say, 50% of the lines from your file, you'll be pushing 25 million captures into the arrays. Depending on the size of your captures, you could have one to several gigabytes stored in the arrays.

If your run-times for the code segment you demonstrated are under 30-45 seconds, you're probably doing about as best as can be expected for a single process working with a file. If the time is over a couple minutes, you're probably swamping memory and doing a lot of paging out behind the scenes. If that's the case, consider instead of pushing into @good and @sample arrays, writing entries to a couple of output files. This will add IO overhead to the process, but will remove the memory impact which is probably generating even more IO overhead behind the scenes at a much lower layer.

Once the 'sample' and 'good' files are written, you can process them line by line to do with them what you would have done with the arrays. Another alternative would be instead of pushing onto @sample and @good, do the processing that will later happen on @sample and @good just in time for each line of the input file. IE:

my %dispatch = ( sample => sub {my $capture = shift; # do something with $capture} +, good => sub {my $capture = shift; # do something with $capture} +, ); while(<FILE>) { if (/^(sample|good)\s+(\S+)/) { $dispatch{$1}->($2); } }

As long as # do something with $capture does not include storing the entire capture into an array, this should pretty much wipe out the large memory footprint.


Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1221322]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others pondering the Monastery: (5)
As of 2020-11-23 16:48 GMT
Find Nodes?
    Voting Booth?

    No recent polls found