Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

Parsing Arabic PDF using in perl

by fattahsafa (Novice)
on Mar 07, 2014 at 17:56 UTC ( #1077433=perlquestion: print w/ replies, xml ) Need Help??
fattahsafa has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I need to parse Arabic PDF file in perl. Which PDF Perl Module does support unicode? Thanks, Abed

Comment on Parsing Arabic PDF using in perl
Re: Parsing Arabic PDF using in perl
by runrig (Abbot) on Mar 07, 2014 at 18:57 UTC
      Thank you for your response. Actually Didn't work. The problem in supporting Arabic, not in parsing. --Abed.
        Not sure what your problem is then. Can you describe it?
Re: Parsing Arabic PDF using in perl
by LanX (Canon) on Mar 07, 2014 at 23:00 UTC
    Please provide a link to such an unreadable PDF.

    Then try pdftohtml -xml on it.

    And please keep in mind that PDFs which don't use a standard font are not necessarily parsable, cause they might embed an own font with random code-points for the glyphs.

    In this case deciphering is only possible with character and word recognition. Either automatic (OCR) or human (by populating a hash $glyph{codepoint} for each unknown font)

    HTH! (Inshallah =)

    Cheers Rolf

    ( addicted to the Perl Programming Language)

Re: Parsing Arabic PDF using in perl
by graff (Chancellor) on Mar 08, 2014 at 02:31 UTC
    The use of PDF to present Arabic text can follow at least a few different strategies, none of which bode well for the extraction of Unicode Arabic text from a PDF file. Some or all of the text may actually be stored as image data rather than as character data, and to the extent that there are portions of text comprised of discrete characters, those characters use numeric assignments that bear no discernible relation to Unicode Arabic code points.

    I remember spending a few hours one time (a couple years back) trying to find web references that would explain the PDF character encoding scheme for Arabic, but I never succeeded. Of course, I'm ignorant enough about PDF details in general that I can't even assess how inadequate that attempt was.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1077433]
Approved by graff
Front-paged by GotToBTru
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others browsing the Monastery: (8)
As of 2014-11-28 22:03 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My preferred Perl binaries come from:














    Results (200 votes), past polls