|Pathologically Eclectic Rubbish Lister|
Re^5: Mechanize Firefox text Methodby afoken (Monsignor)
|on May 05, 2013 at 18:27 UTC||Need Help??|
Let me clarify a bit. Since I can read the documents in the browser I know they contain only text so OCR is not an issue.
I think we have a little communication problem: Sure you can read text displayed in Firefox, because it was rendered from something like <html><body><h1>Hello</h1>. But you can also read text displayed in Firefox that was rendered from something like <html><body><img src="http://www.example.com/pics/hello.gif" alt="">. Your computer can't, at least not as easy as you. To extract the text from the latter, you need OCR.
All the documents follow a similar set of templates but the content changes for each.
Any chance to get access to the data before the template engine creates the PDF? Perhaps as XML, JSON, CSV or even HTML?
Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)