I like to collect words from a pdf or word document!
So far Perl Power tools does a very good job! Thanks | [reply] |
For collecting words from pdf documents, you can use
the ps2ascii utility which comes with
ghostscript. It executes the document with ghostscript,
using a special device that outputs only ascii text.
As ghostscript can handle pdfs too, ps2ascii works fine on
them (although I did have
some compatibility problems with some pdfs, depending
on the generating program and the version of ghostscript).
This doesn't work for word documents of course.
| [reply] |
OP, you may have some luck loading MS Word into (star|open)office, printing to pdf then chucking it at ps2ascii. As it is the exact same formating that is hardest for *office to get correct and ascii has little remmenant of these I guess you could have a lot of luck.
update
As ambrus points out below of course if you can read the word doc into *office then you can just export ASCII from there. Sorry, it has been a rather long day
You may also want to trawl through a list of filters, I found this one which looks like it may have some tools that could help
Cheers, R.
Pereant, qui ante nos nostra dixerunt!
| [reply] |