Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things

files saved in unicode are not being read correctly

by abhishes (Friar)
on Jun 30, 2002 at 11:09 UTC ( [id://178353] : perlquestion . print w/replies, xml ) Need Help??

abhishes has asked for the wisdom of the Perl Monks concerning the following question:

Hello All,

I created a simple text file and saved it once in asci format and one more time in unicode format.

now when I run this simple program

#!/usr/bin/perl open MYFILE1, "c:\\junk\\file-ansi.txt" or die; open MYFILE2, "c:\\junk\\file-unicode.txt" or die; print "going to print the asci file\n"; print "------------------------------------------- \n"; while (<MYFILE1>) { print "$_"; } print "\n\ngoing to print the unicode file\n"; print "------------------------------------------- \n"; while (<MYFILE2>) { print "$_"; } close MYFILE1; close MYFILE2;
I see the following output.
going to print the asci file ------------------------------------------- my name is abhishek going to print the unicode file ------------------------------------------- &#9632;m y n a m e i s a b h i s h e k

Why is the unicode file not being read correctly (multi byte?) and how to make the output correct for the unicode file as well (so that it appears, just like the way the asci output is appearing?)

regards, Abhishek.

Replies are listed 'Best First'.
Re: files saved in unicode are not being read correctly
by amphiplex (Monk) on Jun 30, 2002 at 12:16 UTC
    Try Unicode::Map:
    NAME Unicode::Map V0.112 - maps charsets from and to utf16 uni­ code SYNOPSIS use Unicode::Map(); $Map = new Unicode::Map("ISO-8859-1"); $utf16 = $Map -> to_unicode ("Hello world!"); => $utf16 == "\0H\0e\0l\0l\0o\0 \0w\0o\0r\0l\0d\0!" $locale = $Map -> from_unicode ($utf16); => $locale == "Hello world!"
    you could then write something like this to read your unicode file (if it is utf16):
    use strict; use Unicode::Map; my $Map = new Unicode::Map({ ID => "ISO-8859-1" }); while (<>) { print $Map->from_unicode($_); }

    ---- kurt

      Thanks for your help kurt. I need one more help.

      In my application I get the list of files to open from the glob function.

      so before opening the file I should know whether to use Unicode::Map or the normal mode (depending on whether the file is saved in acsii or unicode.)

      Is that possible to do as well? what will be the impact if I use Unicode::Map for all files returned by the glob function?

      thanks for your help.

        There's no 100% solution to your question, because even Unicode file may be considered as binary or even ascii.

        AFAIK usually unicode file starts with "\xFE\xFF" or "\xFF\xFE" but this is not always true.

        If a file is created by your own program, then, for example, use a special naming system, for example unicode text files let have an extension ".utxt" and ascii files just ".txt"

        Courage, the Cowardly Dog.

        sorry, I cannot answer this, anyone else ?

        ---- kurt
Re: files saved in unicode are not being read correctly
by Courage (Parson) on Jun 30, 2002 at 12:30 UTC
    If you want to use new feature of perl-5.8.0 and to read directly unicode data, you should tell to perl that your file is Unicode:
    open(my $fh,'<:utf8', 'anything'); my $line_of_unicode = <$fh>; open(my $fh,'<:encoding(Big5)', 'anything'); my $line_of_unicode = <$fh>;
    I got that code samples from perluniintro.pod, which is beautiful reading to start.

    And yes, if you're not ready to move to perl-5.8.0 yet, then use Unicode::Map module to solve your task.

    Courage, the Cowardly Dog.

      I am using perl 5.6 release 631 from active state perl. where can I get active state perl 5.8? I searched google but 5.8 rc2 is available only on does that one have the Win32::Ole modules as well?

        In this case you better wait for a moment when 5.8.0 will be available, and ActiveState will prepare an "official" build for Win32 platform. Currently just use "Map::Unicode"!

        Courage, the Cowardly Dog.

Re: files saved in unicode are not being read correctly
by amphiplex (Monk) on Jul 01, 2002 at 09:07 UTC

      Thank you so much kurt.