Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

comment on

( #3333=superdoc: print w/replies, xml ) Need Help??

I did the following:

perl -e 'open my $outfh, ">", "sample.txt"; while ($i++ < 50_000_000) +{print $outfh "abcdefghijklmnopqrstuvwxyz0123456789\n";}'

On my laptop with an SSD that took about fifteen seconds to run. Then I did this:

perl -E 'open my $infh, "<", "sample.txt"; while(<$infh>) {$i++} say $ +i;'

And that took about eight seconds to run. In the case of your code, within the while() {...} loop you're invoking the regex engine, doing a capture, and pushing onto two arrays. If you have "hits" in the case of, say, 50% of the lines from your file, you'll be pushing 25 million captures into the arrays. Depending on the size of your captures, you could have one to several gigabytes stored in the arrays.

If your run-times for the code segment you demonstrated are under 30-45 seconds, you're probably doing about as best as can be expected for a single process working with a file. If the time is over a couple minutes, you're probably swamping memory and doing a lot of paging out behind the scenes. If that's the case, consider instead of pushing into @good and @sample arrays, writing entries to a couple of output files. This will add IO overhead to the process, but will remove the memory impact which is probably generating even more IO overhead behind the scenes at a much lower layer.

Once the 'sample' and 'good' files are written, you can process them line by line to do with them what you would have done with the arrays. Another alternative would be instead of pushing onto @sample and @good, do the processing that will later happen on @sample and @good just in time for each line of the input file. IE:

my %dispatch = ( sample => sub {my $capture = shift; # do something with $capture} +, good => sub {my $capture = shift; # do something with $capture} +, ); while(<FILE>) { if (/^(sample|good)\s+(\S+)/) { $dispatch{$1}->($2); } }

As long as # do something with $capture does not include storing the entire capture into an array, this should pretty much wipe out the large memory footprint.


Dave


In reply to Re: About text file parsing by davido
in thread About text file parsing by dideod.yang

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?
    Username:
    Password:

    What's my password?
    Create A New User
    Chatterbox?
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others rifling through the Monastery: (8)
    As of 2020-11-23 16:40 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?

      No recent polls found

      Notices?