|P is for Practical|
Benign Web Minerby perlmonkey2 (Beadle)
|on Sep 30, 2006 at 17:00 UTC||Need Help??|
perlmonkey2 has asked for the
wisdom of the Perl Monks concerning the following question:
For a while now, I've been working on a problem for my academic research center. The problem flows like this. We start with a list of URLs. We would like to mine the text from those URLs and then follow the links found on those pages. The problem is we mine a lot of text, so disk fragmentation is a huge issue. Previously we were using Wget, but 50 Wget's writing to disk at the same time quickly put your NTFS mean file fragment size at 4KB. So a replacement for Wget was planned that would only have one file written to disk at any given moment.
The problem with existing Perl solutions is they don't have the functionality required. For instance, they need to be able to block while other instances are writing to disk, they should have the option to span hosts, accept a huge domain-don't-go-there list, to ignore extension types, to span hosts, and to set recursive depths. I.E. Most of the main features of Wget.
So far I've made a hash out of using Win32 threads and LWP::UserAgent, as the number of threads had to stay low or Perl.exe would die. In order to be nice to the servers, very few hits per minute are allowed. This means highest number of stable Win32 threads (50) goes VERY slowly. The solution was to have the threads share the lists of URLs to get and make sure each domain wasn't hit very often, but if there are a thousand domains, then 50 threads could move very quickly. This led to problems with Win32 threads::shared not working well with extremely large data structures (I had 50 miners writing to a queue where one disk writer wrote to disk).
So now I'm thinking that LWP::Pararrel::UserAgent will resolve the issue, as it will be single threaded, yet able to search many sites at the same time.
If anyone has any thoughts, ideas, or recommendations, I would appreciate it.