http://www.perlmonks.org?node_id=1040309


in reply to Re: exporting PDF::API2
in thread exporting PDF::API2

In some PDFs, especially ones created from lets say Illustrator to Acrobat through Adobe Distiller. Each letter of text gets flattened into a color filled polygon/beizer curves. There is no text in the COS tree of the PDF, just postscript polygon draw operators. I think OCR is the only way to get back computer meaning of the text. A WAG says since it all came from one font in a vector graphics program, you could try to programatically checksum each polygon against a known checksum of the polygon of each letter which was human IDed. I would look for a library that does this already, implementing on your own is insane.