Re^3: TCP socket and fork

You have various options, as I explained above.

Use a single socket and the HTTP/1.1 keepalive feature, then play request-response ping-pong over that socket. No threads required, just use a single LWP::UserAgent instance for this. This avoids that little bit of TCP handshake, but serialises all your requests.
Fork as many threads or processes as you like, and let each process fetch one resource, nearly as you do now, with 20 independent instances of LWP::UserAgent behind the scenes. This costs many TCP handshakes, but allows you to saturate your network connection (or that of the server).
Mix both approaches. Create a controlling thread/process that forks several slaves (let's just say four), then gives each slave a new URL to fetch as soon as the slave is idle. Use keepalive in each of the slaves. This uses most of your bandwidth and avoids some TCP handshakes. Note that the number of requests processed by each slave depends entirely on how fast it can hande its job. A slave that has to fetch a gigabyte of data will propably process only one request, while other slaves that get tiny repsonses will process lots of requests.
Simplified mix: Create just a bunch of slaves (again, let's assume four slaves), each with a constant fraction of the URL list to be processed (five entries, in this example). This does not balance as well, but requires less code. If one unlucky slave has to process five gigabyte responses, while the other slaves got away with a few kilobytes, you will wait a long time for the last slave.

Why are you so worried about TCP handshakes? TCP handshake requrires three TCP packages. A simple GET request adds one more package, and the response uses round about one package for the HTTP headers and then two packages for every three KBytes of data. (Assuming we are talking about ethernet, PPP or PPPoE). As soon as your response is larger than a few KBytes, the TCP handshake does not really matter. If you (ab)use HTTP as a way to transport tons of tiny messages in some RPC protocol, TCP handhake really matters.

Alexander

--
Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)

Comment on Re^3: TCP socket and fork

Replies are listed 'Best First'.
Re^4: TCP socket and fork by adismaug (Acolyte) on Jul 04, 2009 at 12:26 UTC
Dear Alexander, My concern is not about the TCP handshake but more on the server resources. Lets say I need the get 10 pages from the server, each page takes 1 second to retrieve so theoretically if I use 10 threads it should take one second to get 1 page or 10 pages. The problem starts when the server is busy with a lot of requests, it needs to allocate resources for each request and there for some of then take longer the 1 second. If my client was to use the same resource allocated by the server (same TCP socket) for all my threads then I will not suffer from the delay because the server already allocated the resources for me. On the other hand if each thread opens a new socket then for some threads the server will delay the replay. This is a big problem and I cannot find a way to solve it. Any ideas? Best Regards, Adi.	[reply]
Re^5: TCP socket and fork by afoken (Chancellor) on Jul 04, 2009 at 16:58 UTC
My concern is not about the TCP handshake but more on the server resources. So you want to be a friendly net citizen. Good. Open one connection, with keep-alive, and ask for one resource after the other. This is the solution with the least load on the server. The problem starts when the server is busy with a lot of requests, it needs to allocate resources for each request and there for some of then take longer the 1 second. Why do you insist on that one second timeout? There is no guaranteed response time in Ethernet, IP, TCP or HTTP. Even your own OS does not guarantee that your program will run again one second after issuing a system call. Sure, most times, it will run again much earlier. And like with your own OS, most times, a HTTP server somewhere on this planet answers within one or a few seconds. If you have realtime requirements, use realtime operating systems, realtime networks, and realtime protocols. Standard ethernet, most Linux, *BSD, and Windows distributions, Apache, Perl and HTTP all can't cope with real-time requirements. (Note that realtime does not mean fast response times. It means guaranteed response times under all conditions, even under maximum load. Defining what is guaranteed is a completely other story and may end with a guaranteed response time of five seconds.) If you don't have realtime requirements, forget that funny timeout. Your server seems to be too small or too loaded. Think about the algorithms used on the server. Can you use an O(n) or an O(n log n) algorithm where you currently use an O(n²) algorithm? Think about pre-calculating data and caching on the server, for example with memcached. Can the server deliver you an all-in-one document containing everything you really want, probably unstyled and packed, instead of having you spider tons of pages? Throw some money at the problem: If we talk about an all-in-one server with email, web, database, and loads of other jobs, spread the jobs on several dedicted servers. Use one machine for web, one machine for the database, one machine for email, one machine for each of the other jobs. Separating web server and database usually gives you the "biggest bang". Get a server with faster CPUs, perhaps multi-core CPUs, a faster harddisk subsystem, more RAM, and a faster network connection. If off-the-shelf hardware doesn't help, ask experts for high-performance computing. Don't be scared by the prices. You can easily spend more money for a single high-performance computer than for a new car. Alexander -- Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)	[reply]


Your skill will accomplish what the force of many cannot
	PerlMonks