Re^3: Advice on Efficient Large-scale Web Crawling


Your skill will accomplish what the force of many cannot
	PerlMonks

Re^3: Advice on Efficient Large-scale Web Crawling

by salva (Canon)

on Dec 19, 2005 at 14:22 UTC ( [id://517731]=note: print w/replies, xml )

Need Help??

in reply to Re^2: Advice on Efficient Large-scale Web Crawling
in thread Advice on Efficient Large-scale Web Crawling

To achieve this with the current architecture I'm limited to about 12 -15 concurrent processes

that limit seems too low for the task you want to accomplish, specially if you have a good internet connection. Have you actually tried incrementing it to 30 or even 50. Forking is not so expensive in moderm Unix/Linux systems with support for COW.

update: actually, much of the overhead generated by the forked processes can be caused by perl cleaning up everything. On Unix, this cleanup is mostly useless, and you can get rid of it calling

exec $ok ? '/bin/true' : '/bin/false';
[download]

instead of exit($ok) to finalize child processes. Just remember to close first any file you had written to.

Comment on Re^3: Advice on Efficient Large-scale Web Crawling Select or Download Code

Replies are listed 'Best First'.
Re^4: Advice on Efficient Large-scale Web Crawling by Celada (Monk) on Dec 19, 2005 at 15:28 UTC
That's what `POSIX::_exit` is for. Exit the process without giving Perl (or anything else) a chance to clean up.	[reply] [d/l]

In Section Seekers of Perl Wisdom

Domain Nodelet^?

www.com | www.net | www.org

Node Status^?

node history
Node Type: note [id://517731]
help

Chatterbox^?

How do I use this? • Last hour • Other CB clients

Other Users^?

Others chanting in the Monastery: (7)

As of 2024-04-26 08:59 GMT

Sections^?

Information^?

Find Nodes^?

Leftovers^?

Today I Learned

Voting Booth^?

No recent polls found