Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

Perl Encoding/Decoding Doubt: From a Novice

by ppremkumar (Novice)
on Jul 04, 2013 at 11:00 UTC ( [id://1042435]=perlquestion: print w/replies, xml ) Need Help??

ppremkumar has asked for the wisdom of the Perl Monks concerning the following question:

Hi, Team

I am at loss in terms of understanding the encoding process required in Perl with regard to writing results to the command prompt or to a file.

In the below code, the first portion outputs just fine. In portion 2, however, when I added an em dash or the set of characters "ĀǎỠĨǒAder," the output is junk. (Yes, I want to print out ĀǎỠĨǒAder as is.)

use warnings; use strict; use Encode qw(encode decode); # portion 1 my $str = 'Çirçös'; $str = decode('utf-8', $str); print "$str\n"; # portion 2 my $str1 = 'Çirçös—'; # HTML entity (decimal) for em dash: — $str1 = decode('utf-8', $str1); print "$str1\n"; # output from my Eclipse editor # Çirçös # Çirçös— # Wide character in print at D:/EPIC_workspace/PERL/Bibliography/test. +pl line 10.

Please help me understand what I am doing wrong.

What I am really trying to do is read a Microsoft Word file that has special characters and store that data into a text file.

Thanks,

Prem

UPDATE: I have found a solution to my problem: http://www.lemoda.net/perl/win32-ole-utf8/cp-utf8-ole.html

I had to set the Win32 component to CP_UTF8 and set the code page of Win32::OLE to CP_UTF8.

Now, even if my Microsoft Word files have special characters such as "Aderñŋšžľŀīửừứ," I could read each line of the Word file and save it in a text file without loss of characters. I thank each of you for your help and time. Greatly appreciated

# Get the constant. use Win32::OLE 'CP_UTF8'; # Set the code page of Win32::OLE. $Win32::OLE::CP = CP_UTF8;

Replies are listed 'Best First'.
Re: Perl Encoding/Decoding Doubt: From a Novice
by hippo (Bishop) on Jul 04, 2013 at 11:13 UTC

    You are decoding from your source OK, but the problem is that you are not encoding to the target. Specifically you are trying to print decoded data rather than encoded data, hence the "Wide character" message.

    I note from the path "D:/EPIC_workspace/..." that you are probably on a Windows box, so there may be other O/S specific things going on there too of which us non-windows-users are blissfully unaware.

    HTH, hippo

      Thank you, hippo

Re: Perl Encoding/Decoding Doubt: From a Novice
by Loops (Curate) on Jul 04, 2013 at 11:43 UTC

    In order to avoid some Unicode bugs, unicode_strings is recommended. For utf-8 input and output, use open ":encoding(UTF-8)". However it is probably wise to respect the local configuration, which is what I do instead in the code below. Lastly you need to tell the perl interpreter that the literal strings in your source code are utf-8, via "use utf8;".

    So with any luck, the code below should work:

    use strict; use warnings; use feature 'unicode_strings'; use open ":locale"; use utf8; print "Çrçös\n"; print "Çirçös—\n";

      I should have told you this, but I am using Windows 7, and I am unable to use use open ":locale"; here.

Re: Perl Encoding/Decoding Doubt: From a Novice
by Khen1950fx (Canon) on Jul 04, 2013 at 14:01 UTC
    Since you're are reading a Word file, you want to use the reverse: encode_utf8 and encode instead of decode. I use binmode to print the set of characters 'as is'.
    #!/usr/bin/perl -l use strict; use warnings; use Encode qw(encode encode_utf8); $| = 1; my $str = "Çirçös"; $str = encode_utf8($str); print $str; my $str1 = "Çirçös—"; $str1 = encode('UTF-8', $str1); binmode STDOUT, ":encoding(UTF-8)"; print $str1;

      I tried the coding you provided, but I get the below output.

      Çirçös Çirçös—

      If it helps, I am using Windows: Windows 7 Professional 32-bit (6.1, Build 7601) Service Pack 1 (7601.win7sp1_gdr.130318-1533)

        My bad. I was thinking of Linux. How about this:
        #!/usr/bin/perl -l use warnings; use strict; use Encode qw(encode_utf8); $| = 1; binmode STDIN, ":encoding(UTF-8)"; my $str = 'Çirçös'; $str = encode_utf8($str); print "$str"; my $str1 = 'Çirçös—'; $str1 = encode_utf8($str1); binmode STDOUT, ":encoding(UTF-8)"; print "$str1";
        I'm on Fedora, so I'm flying blind. Does that help?

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1042435]
Approved by marto
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others cooling their heels in the Monastery: (10)
As of 2024-04-18 10:05 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found