Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number

Comment on

( #3333=superdoc: print w/replies, xml ) Need Help??
However, it seems that the multithread one ran even much slower than the single thread script.

The problem here is nothing the threading per se -- although it may be compounded by it. A big part of the slowdown is due to asking your harddisk to read from multiple files simultaneously, forcing the read head to dance all over the disk to fetch one block from one file; then one block from another; then one block from another ...; and then back to the first; and so on.

Regardless of the speed of your disk -- the same would be true for SSDs, though less so -- and regardless of whether you use threads or separate processes, reading 12 large files concurrently will be far slower than reading those same file sequentially.

A good analogy would be you trying to read 12 books by reading one page of one then one of the next and so on. Even ignoring the affect that will have on your brain trying to keep all the stories straight; just the simple need to constantly switch from one book to another to another will seriously slow down your throughput rate.

This can be somewhat mitigated by putting the files on different physical drives -- either multiple physical configured as multiple logical drives; or by multiple physical drives raided as a single logical drive -- because one can be actually transferring data whilst other(s) are moving their read heads. But even this will usually result in lower throughput than sequential reading because of the extra context switches (thread or process); system bus and device bus conflicts etc. It also creates extra load on the physical/virtual memory mapping; and the L1/L2/L3 memory and system file caches.

In the past, I have had some success speeding up the reading of many files by serially slurping them into scalars and then handing those scalars off to threads to process line by line -- I do that by opening the slurped scalar as a ram file, and then using the familiar while( <$FH> ){ ... } loop on the scalar -- but even this requires considerable care to ensure that the (huge) slurped scalars don't get unnecessarily duplicated in the process of handing them off to the threads for processing.

Swapping doesn't help

I also note that you are building a shared hash containing (from what you said) 12 shared sub-hashes, each containing 21,000,000 key/value pairs.

On the basis of a simple experiment -- a shared hash containing 5,000,000 key value pairs shared by just 2 threads requires 2.0GB on my 64-bit perl -- I therefore estimate that your shared hash will require at least 12 * 2.0/5*21 = 100.8GB, and possible much more if you have long keys or values. So, unless you have a very large amount of memory you will be moving your system into swapping by loading that much data. Indeed, trying to load that amount into a non-shared hash is going to move you into swapping even if you read them sequentially in a single-threaded process unless you have circa. 64GB ram in your system.

Basically, I think you need to re-think the way you are tackling this problem. Is it really necessary to have all that data in memory concurrently?

What is the data? What calculations are you performing?

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

In reply to Re: problem of my multithreading perl script by BrowserUk
in thread problem of my multithreading perl script by qingfengzealot

Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":

  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?

    What's my password?
    Create A New User
    and all is quiet...

    How do I use this? | Other CB clients
    Other Users?
    Others surveying the Monastery: (6)
    As of 2018-06-23 08:52 GMT
    Find Nodes?
      Voting Booth?
      Should cpanminus be part of the standard Perl release?

      Results (125 votes). Check out past polls.