Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

Tie::File failing with Unicode/UTF-8 encoding?

by HelenCr (Monk)
on Nov 03, 2012 at 13:32 UTC ( #1002104=perlquestion: print w/ replies, xml ) Need Help??
HelenCr has asked for the wisdom of the Perl Monks concerning the following question:

Dear highly esteemed PerlMonks

Update: how do I make the PerlMonks web site show the foreign fonts, instead of the HEX?

I am working on a project which deals with data in foreign languages. My Perl scripts were running fine.

I then wanted to use Tie::File, since this is a neat concept (and saves time and coding).

It seems that Tie:File is failing under Unicode/UTF-8 (unless I am missing something).

Here is a program which depicts the problem: (The data is a mix of English, Greek and Hebrew).

use strict; use warnings; use 5.014; use Win32::Console; use autodie; use warnings qw< FATAL utf8 >; use Carp; use Carp::Always; use utf8; use feature qw< unicode_strings>; use charnames qw< :full>; use Tie::File; my ($i); my ( $FileName); my (@Tied); binmode STDOUT, ':unix:utf8'; binmode STDERR, ':unix:utf8'; binmode $DB::OUT, ':unix:utf8' if $DB::OUT; # for the debugger Win32::Console::OutputCP(65001); # Set the console code page t +o UTF8 $FileName = 'E:\\My Documents\\Technical\\Perl\\Eclipse workspace\\FIB +I OCR\\Work\\'. 'Tie File test res.txt'; tie @Tied, 'Tie::File', $FileName, recsep => "\x0D\x0A", discipline => + ':encoding(utf8)' or confess 'tie @Tied failed'; $i =0; while (<DATA>) { chomp; $Tied[$i] = $_; ++$i; } # end while (<DATA>) $i =0; foreach (@Tied) { say "$i $Tied[$i]"; ++$i; } # end foreach (@Tied) untie $FileName; __DATA__ &#964;&#953; &#954;&#940;&#957;&#949;&#964;&#949;; &#960;&#940;&#961;&#964;&#949; &#964;&#959; &#942; &#945;&#966;&#942;& +#963;&#964;&#949; &#964;&#959; &#1513;&#1500;&#1493;&#1501; &#1495;&#1489;&#1512;&#1497;&#1501; abc &#1500;&#1488; &#1499;&#1503;&#1499;&#1503; efg &#1502;&#1514;&#1497; &#1493;&#1500;&#1488;&#1503; This is it &#1502;&#1506;&#1499;&#1513;&#1497;&#1493; &#1500;&#1506;&#1499;&#1513 +;&#1497;&#1493; &#931;&#942;&#956;&#949;&#961;&#945; &#949;&#943;&#957;&#945;&#953; &# +932;&#961;&#943;&#964;&#951; &#920;&#941;&#955;&#969; &#957;&#945; &#966;&#940;&#969; &#964;&#953; &#954;&#940;&#957;&#949;&#964;&#949;; &#1513;&#1493;&#1512;&#1492; &#1502;&#1505;' 5

This produces a huge cascade of warnings: here is some:

utf8 "\xCE" does not map to Unicode at F:/Win7programs/Dwimperl/perl/l +ib/Tie/File.pm line 917 Tie::File::_read_record('Tie::File=HASH(0x24cb72c)') called at + F:/Win7programs/Dwimper l/perl/lib/Tie/File.pm line 175 Tie::File::_fetch('Tie::File=HASH(0x24cb72c)', 0) called at F: +/Win7programs/Dwimperl/p erl/lib/Tie/File.pm line 210 Tie::File::STORE('Tie::File=HASH(0x24cb72c)', 0, '&#964;&#953; + &#954;&#940;&#957;&#949;&#964;&#949;;') called at tie file test .pl line 31 utf8 "\xCF" does not map to Unicode at F:/Win7programs/Dwimperl/perl/l +ib/Tie/File.pm line 917 Tie::File::_read_record('Tie::File=HASH(0x24cb72c)') called at + F:/Win7programs/Dwimper l/perl/lib/Tie/File.pm line 175 Tie::File::_fetch('Tie::File=HASH(0x24cb72c)', 0) called at F: +/Win7programs/Dwimperl/p erl/lib/Tie/File.pm line 210 Tie::File::STORE('Tie::File=HASH(0x24cb72c)', 0, '&#964;&#953; + &#954;&#940;&#957;&#949;&#964;&#949;;') called at tie file test .pl line 31 utf8 "\xD7" does not map to Unicode at F:/Win7programs/Dwimperl/perl/l +ib/Tie/File.pm line 917 Tie::File::_read_record('Tie::File=HASH(0x24cb72c)') called at + F:/Win7programs/Dwimper l/perl/lib/Tie/File.pm line 175 Tie::File::_fetch('Tie::File=HASH(0x24cb72c)', 0) called at F: +/Win7programs/Dwimperl/p erl/lib/Tie/File.pm line 210 Tie::File::STORE('Tie::File=HASH(0x24cb72c)', 0, '&#964;&#953; + &#954;&#940;&#957;&#949;&#964;&#949;;') called at tie file test .pl line 31 utf8 "\xD7" does not map to Unicode at F:/Win7programs/Dwimperl/perl/l +ib/Tie/File.pm line 917 Tie::File::_read_record('Tie::File=HASH(0x24cb72c)') called at + F:/Win7programs/Dwimper l/perl/lib/Tie/File.pm line 175 Tie::File::_fetch('Tie::File=HASH(0x24cb72c)', 0) called at F: +/Win7programs/Dwimperl/p erl/lib/Tie/File.pm line 210 Tie::File::STORE('Tie::File=HASH(0x24cb72c)', 0, '&#964;&#953; + &#954;&#940;&#957;&#949;&#964;&#949;;') called at tie file test .pl line 31

Then it prints this on STDOUT:

0 &#964;&#953; &#954;&#940;&#957;&#949;&#964;&#949;; 1 &#960;&#940;&#961;&#964;&#949; &#964;&#959; &#942; &#945;&#966;&#942 +;&#963;&#964;&#949; &#964;&#959; 2 &#1513;&#1500;&#1493;&#1501; &#1495;&#1489;&#1512;&#1497;&#1501; 3 abc &#1500;&#1488; &#1499;&#1503;&#1499;&#1503; efg 4 &#1502;&#1514;&#1497; &#1493;&#1500;&#1488;&#1503; This is it 5 &#1502;&#1506;&#1499;&#1513;&#1497;&#1493; &#1500;&#1506;&#1499;&#15 +13;&#1497;&#1493; 6 &#931;&#942;&#956;&#949;&#961;&#945; &#949;&#943;&#957;&#945;&#953; +&#932;&#961;&#943;&#964;&#951; 7 &#920;&#941;&#955;&#969; &#957;&#945; &#966;&#940;&#969; 8 &#964;&#953; &#954;&#940;&#957;&#949;&#964;&#949;; 9 &#1513;&#1493;&#1512;&#1492; &#1502;&#1505;' 5 10 11 12 13 14 \xA4\x&#920;&#941;&#955;&#969;\xA8\x 15 16 17 18 19

Note that the first 9 lines are OK, but lines 10 through 19 came from nowhere!?

In addition, the output file contains corrupted data:

&#964;&#953; &#954;&#940;&#957;&#975;N&#847;&#334;&#1376;&#964;&#942 +;&#963;&#964;&#949; &#1513;&#1500; &#1495;&#1489;&#1512;&#1569;bc &#1 +500;&#1559;&#1815;&#2071;&#1815;&#2016;e&#1502;&#1514;&#1493;&#1500;& +#1488;&#1503; This is &#1502;&#1506;&#1497;&#1493; &#1500;&#1506;&#14 +99;&#1550;&#270;&#974;&#1870;&#1423;&#957;&#945;&#953; &#932;&#961;&# +920;&#941;&#974;&#1934;&#1120;&#966;&#975;&#334;&#1632;&#954;&#964;&# +949;;&#1513;&#1512;&#1492; &#1502;&#1505;' \xA4\x&#920;&#941;&#955;&#969;\xA8\x

Something is very wrong here. Either I am missing something, or Tie:File can't cope with Unicode/UTF-8?

I am runnning Strawberry Perl 5.14 on a Windows 7 system.

Many TIA - Helen

Note: cross- posted on http://stackoverflow.com/questions/13209474/

Comment on Tie::File failing with Unicode/UTF-8 encoding?
Select or Download Code
Re: Tie::File failing with Unicode/UTF-8 encoding?
by Kenosis (Priest) on Nov 03, 2012 at 15:34 UTC

    Crossed posted on StackOverflow.

    "It is considered polite to inform about crossposting so people not attending both sites do not waste their time solving a problem already closed at the other end of the internet."

    My apologies. I failed to see your notice at the bottom of your posting...

      The OP did note that she cross-posted.

        Indeed. I failed to note her note. Corrected. Thanks...

Re: Tie::File failing with Unicode/UTF-8 encoding?
by Kenosis (Priest) on Nov 03, 2012 at 16:38 UTC

    Running your code produces the following console output on my machine (Windows 7, ActivePerl):

    0 &#964;&#953; &#954;&#940;&#957;&#949;&#964;&#949;; 1 &#960;&#940;&#961;&#964;&#949; &#964;&#959; &#942; &#945;&#966;&#942 +;&#963;&#964;&#949; &#964;&#959; 2 &#1513;&#1500;&#1493;&#1501; &#1495;&#1489;&#1512;&#1497;&#1501; 3 abc &#1500;&#1488; &#1499;&#1503;&#1499;&#1503; efg 4 &#1502;&#1514;&#1497; &#1493;&#1500;&#1488;&#1503; This is it 5 &#1502;&#1506;&#1499;&#1513;&#1497;&#1493; &#1500;&#1506;&#1499;&#15 +13;&#1497;&#1493; 6 &#931;&#942;&#956;&#949;&#961;&#945; &#949;&#943;&#957;&#945;&#953; +&#932;&#961;&#943;&#964;&#951; 7 &#920;&#941;&#955;&#969; &#957;&#945; &#966;&#940;&#969; 8 &#964;&#953; &#954;&#940;&#957;&#949;&#964;&#949;; 9 &#1513;&#1493;&#1512;&#1492; &#1502;&#1505;' 5

    The tied-file's contents:

    &#964;&#953; &#954;&#940;&#957;&#949;&#964;&#949;; &#960;&#940;&#961;&#964;&#949; &#964;&#959; &#942; &#945;&#966;&#942;& +#963;&#964;&#949; &#964;&#959; &#1513;&#1500;&#1493;&#1501; &#1495;&#1489;&#1512;&#1497;&#1501; abc &#1500;&#1488; &#1499;&#1503;&#1499;&#1503; efg &#1502;&#1514;&#1497; &#1493;&#1500;&#1488;&#1503; This is it &#1502;&#1506;&#1499;&#1513;&#1497;&#1493; &#1500;&#1506;&#1499;&#1513 +;&#1497;&#1493; &#931;&#942;&#956;&#949;&#961;&#945; &#949;&#943;&#957;&#945;&#953; &# +932;&#961;&#943;&#964;&#951; &#920;&#941;&#955;&#969; &#957;&#945; &#966;&#940;&#969; &#964;&#953; &#954;&#940;&#957;&#949;&#964;&#949;; &#1513;&#1493;&#1512;&#1492; &#1502;&#1505;' 5

    I didn't get the errors you're reporting.

    On a separate topic, Tie::File works well with smaller files, but performance significantly slows with files > 10M, in case you may be working with such.

      Kenosis: this is very interesting.

      1. Is there a quick way to convert back the PerlMonks output, so that Unicode/UTF8 editors can show it in the original encoding?

      2. It seems that in your setup (apparently, the difference is ActivePerl vs. StrawberryPerl/DWIMPerl that I am using), the problem does not occur. Could the cause be the Tie::File code? StrawberryPerl uses $VERSION = "0.97_02"; Which version is ActivePerl using?

        1. I don't know about these things... Yet...
        2. ActivePerl uses $VERSION = "0.98" for Tie::File. Perhaps upgrading the Module is worth a try (or tie).
Re: Tie::File failing with Unicode/UTF-8 encoding?
by Anonymous Monk on Nov 03, 2012 at 16:55 UTC

    Update: how do I make the PerlMonks web site show the foreign fonts, instead of the HEX?

    You can't, sorry

    The good news is you don't really need to :) simply post proper Dumps ;)

    perl -MData::Dump -MFile::Slurp -e " dd scalar read_file shift, { qw/ binmode :raw / }; " input.txt > input_txt_as_perl.pl

Re: Tie::File failing with Unicode/UTF-8 encoding?
by Anonymous Monk on Nov 04, 2012 at 00:51 UTC

    I don't think it is supported. The option you used is called discipline and that is perl 5.6 talk , that feature is now called PerlIO layers

    Also, that feature is not documented in Tie::File ; And, in 5.6, that feature only supported :raw and :crlf, as :encoding did not exist ; You should file a bug report (did I say that?)

    But you can work around this limitation by doing your own encoding/decoding. This seems to work

    #!/usr/bin/perl -- use Path::Class; use constant THISFILE => file( __FILE__ )->absolute->stringify; use constant THISDIR => file( THISFILE )->dir->stringify; use strict; use warnings; use Tie::File; use Fcntl 'O_RDWR', 'O_CREAT'; use Encode; chdir THISDIR or die Fudge( 'chdir', THISDIR ); Main( @ARGV ); exit( 0 ); sub Main { binmode STDOUT; my $FileName = "".file( THISFILE . '.tf.utf8.txt' ); tie my(@Tied), 'Tie::File', $FileName, recsep => "\x0D\x0A", mode => O_RDWR | O_CREAT, or die Fudge( tie => $FileName ); if( @Tied > 2 ){ print 'size ', int @Tied, "\n"; print map { decode('UTF-8', $_ ); "($_)" } @Tied; } else { push @Tied, encode('UTF-8', chr $_) for 272 .. 30000; } untie @Tied; } sub Fudge { use Errno(); join qq/\n/, "Error @_", map { " $_" } int( $! ) . q/ / . $!, int( $^E ) . q/ / . $^E, grep( { $!{$_} } keys %! ), q/ /; } __END__

    So you might subclass it like this (untested)

    package Tie::File::UTF8; use parent qw[ Tie::File ]; sub TIEARRAY { my $class = shift; $class->SUPER::new( @_, recsep => "\x0D\x0A", discipline => ':raw', ); } sub FETCH { my( $self, $n ) = @_; my $rec = $self->SUPER::FETCH( $n ); $rec = decode( 'UTF-8', $rec ); $rec; } sub STORE { my( $self, $n, $rec ) = @_; $rec = encode( 'UTF-8', $rec ); $self->SUPER::FETCH( $n, $rec ); }

    I don't know about more exotic encodings , but it seems UTF-8 could be supported easily

      Hello.

      I thought getter, setter method will avoid this trouble. Below was my trial. MyTieFile.pm

      And test script
      And I saw your post. So, overriding STORE,FETCH seems more elegant. I tried your untested code but it seems producing same warnings message for utf8 characters.

      Why???

      This seems to be a known problem because there is stackoverflow thread and I see the name "ikegami"

      I hope "ikegami" or someone explains this problem , a little more...

        And I saw your post. So, overriding STORE,FETCH seems more elegant. I tried your untested code but it seems producing same warnings message for utf8 characters.

        Why???

        I don't know , I didn't test it :)

        remiah: I studied your module and test script: you've done a very good job - it's working. Thank you for that.

        But, what this effectively does (as UNK noted in his answer here: http://stackoverflow.com/questions/13209474/ ), is re-encoding the data before inserting it into the tied array and the tied file; so the array does not contain Unicode data in internal Perl representation, but instead simply contains the imported UTF-8 strings.

        Now in my project, I am doing regex comparisons and substitutions against the tied array; so if I go this route, I'll have to re-decode the array element before any processing, and re-encode it again.

        What do you think?

        Many thanks for your well-thought-out answer.

        Helen

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1002104]
Approved by BrowserUk
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others about the Monastery: (14)
As of 2015-07-06 12:43 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (74 votes), past polls