|There's more than one way to do things|
Build a PDF book indexby markong (Pilgrim)
|on Mar 17, 2018 at 11:37 UTC||Need Help??|
markong has asked for the wisdom of the Perl Monks concerning the following question:
I need to build a book index (the book is in PDF format version 1.6) and I'm facing some problems which are related to the way encodings are "specified" in a PDF file. I've found CAM-PDF to be very useful in extracting the PDF content in textual form, so I'm following those steps to build the index:
I've practically zero knowledge of the inner workings of the PDF format, and I have a tight schedule at the moment, so I have no time to wade trough the 700+ pages of the PDF spec looking for how the PDF store "plain" text; despite this, knowing that PDFs are bin files, they "should" not encode text in any particular form, but they probably pack all the information in some sort of "structure". For what I've read about PDF files, they usually embed fonts and then map single glyphs to "bytes", resulting usually in some sort of custom encoding. This would explain why I see some characters (e.g.: apostrophes and prolonged dashes as gibberish). It would seems that the PDF at hand maps letters to ASCII while the rest of ASCII chars are somehow mapped to custom bytes. For instance following is the hexdump of an extract of the file containing the sentence "The developer, on the other hand, feels like he’s interrupted several times a day for meetings, "
and as you can see at offset 00000020, the apostrophe is extracted by CAM::PDF as an 0x80, which if I recall well is the EURO sign in ASCII.
My question is then: how can I solve the encoding thing? The keywords to index usually include only letters, but some could have dashes and anyway it feels a little dirty to match a text encoded in a custom/unknown format.
Do you know if PDFs carry the encoding info bit somewhere and any GNU/Linux tool to inspect the PDF to extract it (assuming the encoding is not custom)? I see somebody suggests to open the file in Acrobat Reader (Win) and usually all what is shown is "Custom encoding".
Given the situation I am thinking of scanning the extracted textual content for any byte which is *not* a letter in ASCII and if is not an ASCII byte mapping replace it with the proper ASCII byte values.
As side question: Text-Index is very helpful for the indexing phase, but it lacks a feature to weight the matching for each page (i.e.: a given keyword matches 10 times on page 1 but 2 times on page 10). Does anybody know if there's something on CPAN to help with this? I feel very lazy :).