http://www.perlmonks.org?node_id=714915


in reply to (OT) question about clustering

I've done quite a bit of programming on Linux clusters. Some of that programming has even been with perl. Extracting and digesting documents is a trivially parallelizable problem--just distribute documents among the different machines and let each one chug away.

The bottlenecks are (probably) in transfering the documents into the cluster nodes and in pushing the resulting data out to an SQL Server backend. For the first bottleneck, speed depends on the network into the cluster if you are accessing it remotely, and network speed of the cluster itself. If you have slow disks, this can be a bottleneck, too. For the second bottleneck, talk to an SQL Server DBA about the effective bandwidth of your server and/or clustering solutions for the database itself.

This sort of thing is done by Google all the time on huge clusters, so it may pay to to look at their map-reduce paradign for distributed document processing. Here, the map part is the text extraction and the reduce part is getting all that information organized in an SQL server database.

-Mark