Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

PDF::API2 printing non ascii characters

by Anonymous Monk
on Mar 13, 2018 at 10:13 UTC ( #1210792=perlquestion: print w/replies, xml ) Need Help??
Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi everyone,

I'm using the module PDF::API2 to try and print some non-ascii characters from a web page. I have a web form where I copy and paste non-ascii characters into an input field. When I look at the submitted input using Dumper, I see numeric codes like these:

$VAR1 = 'ω ∞'; And my pdf generation code: sub create { # $sometext is the submitted input from a webpage my $sometext = shift; my $pdf = PDF::API2->new(); my $fonts = { Helvetica => { Bold=>$pdf->corefont('Helvetica-Bold',-encoding=>'latin1'), Roman=>$pdf->corefont('Helvetica',-encoding=>'latin1'), Italic=>$pdf->corefont('Helvetica-Oblique',-encoding=>'latin1'), }, }; my $page = $pdf->page(); my $text = $page->text(); $text->font($fonts->{'Helvetica'}->{Roman}, 20); $text->translate(50, 700); $text->text($sometext); $pdf->saveas('test.pdf'); }

How do I print these as actual non-ascii characters (ω ∞) in the pdf output? Do I need to convert them first? What about "use utf8"? Do I need that line?

Please help :)))

Replies are listed 'Best First'.
Re: PDF::API2 printing non ascii characters
by vr (Hermit) on Mar 13, 2018 at 12:38 UTC

    Started writing this before thanos1983 update (use of 'DejaVuSans.ttf' as an OK modern font, too :) ), so, for FWIW:

    The "core" Helvetica font uses single-byte built-in encoding, which doesn't have greek characters.

    In fact, in modern times it is not advised to use any of Adobe 14 "core", not-to-be-embedded fonts, they belong to the era of 20+ years ago, when storage space was at a premium. Even if you think that you produce (and consume) PDFs in very controlled, ascii-only environment.

    That's said, the "core" font which contains greek and other math characters is called 'Symbol'. You give normal, utf8 Perl strings as arguments to PDF::API2 methods, everything will be encoded for you automatically.

    use strict;
    use warnings;
    use utf8;
    use PDF::API2;
    
    my $pdf  = PDF::API2-> new;
    my $page = $pdf-> page;
    my $text = $page-> text;
    
    my $core_font = $pdf-> corefont( 'Symbol' );
    
    $text-> font( $core_font, 20 );
    $text-> translate( 50, 700 );
    $text-> text( 'ω ∞' );
    
    $pdf-> saveas( 'test.pdf' );
    

    The output is a 5 KB file, which, in addition to necessary overhead, contains a lot of bloat. PDF::API2 doesn't do it quite optimal with "core" fonts. Let's insert this before last line:

    delete @$core_font{ qw/
        Encoding
        FirstChar
        LastChar
        Name
        Widths
    /};
    
    

    The output is 986 bytes. The bad part, however, is that, while PDF looks OK on-screen, text extraction (e.g. copy-paste to Notepad), in both cases above, is broken when I check with Adobe Reader DC (i.e. latest) -- garbage is copied. Maybe Adobe doesn't care about "core" Symbol any more. However, both Firefox and Edge extract greek symbols correctly.

    The right way is to use embeddable, modern, having good Unicode support i.e. large code-points repertoire, TrueType fonts. Again, give "utf8 Perl strings as arguments to PDF::API2 methods, everything will be encoded for you automatically".

    use strict;
    use warnings;
    use feature 'say';
    use utf8;
    use PDF::API2;
    
    my $pdf  = PDF::API2-> new;
    my $page = $pdf-> page;
    my $text = $page-> text;
    
    my $ttf_font = $pdf-> ttfont( 'DejaVuSans.ttf' );
    
    $text-> font( $ttf_font, 20 );
    $text-> translate( 50, 700 );
    $text-> text( 'ω ∞ latin אב езя юя' );
    
    $pdf-> saveas( 'test.pdf' );
    

    Here the text string sports greek, (extended-)latin, hebrew and cyrillic characters. It displays OK on-screen and text can be extracted even with backward Reader DC. File size is 55 KB, however.

Re: PDF::API2 printing non ascii characters
by thanos1983 (Vicar) on Mar 13, 2018 at 10:29 UTC

    Hello Anonymous Monk,

    One possible way could be with HTML::Entities.

    Sample of code:

    #!/usr/bin/perl
    use strict;
    use warnings;
    use HTML::Entities;
    use open ':std', ':encoding(UTF-8)';
    
    my $html = "Character one: ω character two: ∞";
    print decode_entities($html), "\n";
    
    __END__
    
    $ perl test.pl
    Character one: ω character two: ∞
    

    Update: Adding complete answer. Sample of code from PDF::API2 / unicode characters. The solution to your problem is to add the appropriate font method. From the documentation PDF::API2/FONT_METHODS:

    FONT METHODS @directories = PDF::API2::addFontDirs($dir1, $dir2, ...) Adds one or more directories to the search path for finding font files +. Returns the list of searched directories. $font = $pdf->corefont($fontname, [%options]) Returns a new Adobe core font object.

    In my sample of code I only use one but if you follow the documentation you can add more. I downloaded the fonts from Fonts by DejaVu Fonts.

    Sample of working code:

    #!/usr/bin/perl use strict; use warnings; use PDF::API2; use HTML::Entities; # Create a blank PDF file my $pdf = PDF::API2->new(); # Add a blank page my $page = $pdf->page(); my $font = $pdf->ttfont('DejaVuSans.ttf'); # Add some text to the page my $text = $page->text(); $text->font($font, 20); $text->translate(80, 710); my $html = "Character one: ω character two: &#8734"; my $decoded_string = decode_entities($html); $text->text($decoded_string); # Save the PDF $pdf->saveas('test.pdf');

    Let us know if this works for you. BR / Thanos.

    Seeking for Perl wisdom...on the process of learning...not there...yet!

      What if the submitted html input is "%CF%89%20%E2%88%9E" (ω ∞) instead of the numeric codes below?

      ω ∞

      How do I decode that before handing over to the pdf text method?

        Hello again Anonymous Monk,

        In this case you can use URI::Escape. See sample bellow:

        #!/usr/bin/perl
        use strict;
        use warnings;
        use URI::Escape;
        use feature 'say';
        
        my $str = "Character one: ω character two: ∞";
        my $hex_code = uri_escape( $str );
        say $hex_code;
        
        my $string = uri_unescape( $hex_code );
        say $string;
        
        __END__
        
        $ perl test.pl
        Character%20one%3A%20%CF%89%20character%20two%3A%20%E2%88%9E
        Character one: ω character two: ∞
        

        Hope this helps, BR.

        Seeking for Perl wisdom...on the process of learning...not there...yet!

      It works (bouncing!!!)

      Thank you so much!!!

Re: PDF::API2 printing non ascii characters
by ablanke (Curate) on Mar 13, 2018 at 12:40 UTC

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1210792]
Front-paged by Corion
help
Chatterbox?
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others examining the Monastery: (11)
As of 2018-06-25 20:07 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Should cpanminus be part of the standard Perl release?



    Results (128 votes). Check out past polls.

    Notices?