comment on

Let me clarify a bit. Since I can read the documents in the browser I know they contain only text so OCR is not an issue.

I think we have a little communication problem: Sure you can read text displayed in Firefox, because it was rendered from something like <html><body><h1>Hello</h1>. But you can also read text displayed in Firefox that was rendered from something like <html><body><img src="http://www.example.com/pics/hello.gif" alt="">. Your computer can't, at least not as easy as you. To extract the text from the latter, you need OCR.

All the documents follow a similar set of templates but the content changes for each.

Any chance to get access to the data before the template engine creates the PDF? Perhaps as XML, JSON, CSV or even HTML?

Alexander

--
Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)

In reply to Re^5: Mechanize Firefox text Method by afoken
in thread Mechanize Firefox text Method by halweitz

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Do you know where your variables are?
	PerlMonks