Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery

Re: (OT) question about clustering

by kvale (Monsignor)
on Oct 01, 2008 at 22:43 UTC ( #714915=note: print w/replies, xml ) Need Help??

in reply to (OT) question about clustering

I've done quite a bit of programming on Linux clusters. Some of that programming has even been with perl. Extracting and digesting documents is a trivially parallelizable problem--just distribute documents among the different machines and let each one chug away.

The bottlenecks are (probably) in transfering the documents into the cluster nodes and in pushing the resulting data out to an SQL Server backend. For the first bottleneck, speed depends on the network into the cluster if you are accessing it remotely, and network speed of the cluster itself. If you have slow disks, this can be a bottleneck, too. For the second bottleneck, talk to an SQL Server DBA about the effective bandwidth of your server and/or clustering solutions for the database itself.

This sort of thing is done by Google all the time on huge clusters, so it may pay to to look at their map-reduce paradign for distributed document processing. Here, the map part is the text extraction and the reduce part is getting all that information organized in an SQL server database.


Replies are listed 'Best First'.
Re^2: (OT) question about clustering
by chuckd (Scribe) on Oct 02, 2008 at 01:45 UTC
    Hi Mark, It seems like you might have experience with text extraction based on your reply to my post. I posted another thread in PerlMonks can you give me any advice on this post below:
    I'm looking for someone who might have advice on building a file extraction tool. My company currently uses LAW PreDiscovery to extract text and metadata from files like .msg, .doc, pdf, jpg, gif, etc, etc, etc. This software is an out of the box tool that has many limitations that cause problems for me and other engineers in our group. So, we have been thinking about writing our own tool. I've looked at different modules on CPAN and found many things that I think might help, but don't know if they are any good. Does anyone have any experience builing or writing tools for extracting text from files???? If so what did you use, how did you do it, how big was your project, did you use modules, did you use any API's or .dll's from Microsoft for all of the Microsoft files (.doc, .ppt, .xls, etc.)???

    We are looking to build an in house tool to do all our extraction and need advice on where to start.
      Most of my experience with regards to text extraction is in the context of parsing and extracting data and metadata from files produced in scientific experiments. With the exception of xls documents, these are custom formats for which I would create compilers to extract and transform what text I wanted.

      File formats are little computer languages in disguise. So the general approach of creating a compiler from the format you start with to the format you want will always work in general. In practice, writing compilers for each format can be an arduous process made difficult by incomplete file format specifications, eg, .doc format.

      In your case, you are ETL'ing standard, albeit very different formats. If I were you, I would take advantage of the programs that create these formats to do the extraction. Use Microsoft Word to convert .docs into plain text format. Use Excel to convert .xls files to CSV files. These can be scripted easily enough using VBA/Visual Basic/Visual C# and will extract all the text there is to extract in documents. From there, it is easy to write perl grograms to transform the resulting text to your custom needs.


Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://714915]
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others studying the Monastery: (9)
As of 2017-12-18 09:18 GMT
Find Nodes?
    Voting Booth?
    What programming language do you hate the most?

    Results (475 votes). Check out past polls.