comment on

The idea is to ensure that the cpu bound part of the process never has to wait for data and so uses as many timeslices available to it as possible.

The water marks allow you to easily tailor the threading to maximise throughput.

Using the queues makes it easy to have more than one thread processing the slow part(s) of the processing. Each thread is identical, you just start more of them. They all read their input from the same queue. You don't get this easy flexibility using pipes.

If the processing of the data is the bottleneck you start two threads for that. If outputting to the DB is the bottleneck, have two threads doing that.

If the DB is running in the same box (with 2 cpu's) then it will likely dominate one of them and all the threads will basically share the other. If the DB is on a different box, then the cpu-bound thread may dominate one process and the IO/DB threads share the other.

The yielding should rarely come into play once you get the right watermark levels established, but it acts as a safeguard for the situations where either the IO or DB slows up--someone does a grep on the disk or hits the DB with a heavy query. It prevents the Q from filling memory whilst the processing at the other end is blocked.

The reason I would try threads are:

I'm more familiar with the threading model (forking is only threading under the covers, and without the control, where I live).
I think that IPC through shared memory is more convenient and easier to program that through the flat stream of a pipe.
You can share structured data using threads. I'm not yet certain if it is up to large scale production use, but it is much improved in 5.8.3.

This final point is quite important with the OP's application. Basically he is reading lines, splitting them into chunks, and then throwing them into a DB. The DB IO is quite likely to be the slowest part of the overall processing.

If having split the lines into chunks, he then has to serialise those chunks to pass them through a pipe to the DB process, he hasn't gained anything by splitting out the DB process.

He would then have to deserialise it and the serialisation/deserialisation is likely to take much the same amount of time as the splitting, which negates the reason for having a separate process for the DB IO.

I can't honestly say whether my thoughts would result in faster overall processing. There are too many factors involved. I don't have a dual processor machine to test on. There are many details that the OP hasn't supplied: where is the DB? How much indexing is on the DB? Is the DB shared with other applications? etc.

Until someone actually tries some of this stuff using threads, nobody knows how it will stand up. Until recently, memory leaks prevented any worthwhile testing. With 5.8.3, that seems to be getting much better to the point where it is now worth trying stuff out again.

Examine what is said, not who speaks.

"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail

In reply to Re^2: faster with threads? by BrowserUk
in thread faster with threads? by js1

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


laziness, impatience, and hubris
	PerlMonks