Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine

comment on

( #3333=superdoc: print w/replies, xml ) Need Help??

This is an update on my investigations regarding using ':raw' and sysread(). It is also an apology to most of the monks. The question I have been pursuing for 3 days is:

Why did I consistantly see such dramatic performance improvements with each of the steps outlined in my root node when (apparently) one or two others that tried this out, failed to see similar improvements?

Having persued various avenues including memory allocation/deallocation, mismatched C-runtime libraries/.DLLs, and having completely blown away various non-standard compilers and other tools and finally re-installed perl 5.8.2.

Despite all this, I still see the same dramatic performance increases on my system from each of the steps outlined.

Finally, I know why!

The reason is newlines--or rather, the absence of them.

When I generated my testdata, the 3GB file, I did it using a one-liner than generated random length strings (max. 10,000 as mentioned in the original post upon which I based the idea) of random sequences of A, C, G & T. and kept printing new random lines until the accumulated lengths totalled 3GB+.

Hardly efficient, but it was a one-off (all smaller test files were simply head -c 10485760 3GB.dat > 10MB.dat etc.), easy to type and I was going to watch a movie while it ran anyway.

However, it appears that I omitted one thing, my customary -l. Which meant that none of my test files contained a single newline. And that explains everything.

And so, whilst using ':raw' and sysread do indeed provide some fairly beneficial performance improvements (used correctly), the level of those improvements is far less dramatic than my original post showed, if the file does contain newlines.

And so, I apologies to the community of perlmonks for this misinformation.

My sincerest apologies, BrowserUk.

Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail

In reply to Re: Optimising processing for large data files. (Apology and explaination) by BrowserUk
in thread Optimising processing for large data files. by BrowserUk

Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":

  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?

    What's my password?
    Create A New User
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others examining the Monastery: (3)
    As of 2020-02-23 02:39 GMT
    Find Nodes?
      Voting Booth?
      What numbers are you going to focus on primarily in 2020?

      Results (102 votes). Check out past polls.