Tie::File failing with Unicode/UTF-8 encoding?

HelenCr has asked for the wisdom of the Perl Monks concerning the following question:

Dear highly esteemed PerlMonks

Update: how do I make the PerlMonks web site show the foreign fonts, instead of the HEX?

I am working on a project which deals with data in foreign languages. My Perl scripts were running fine.

I then wanted to use Tie::File, since this is a neat concept (and saves time and coding).

It seems that Tie:File is failing under Unicode/UTF-8 (unless I am missing something).

Here is a program which depicts the problem: (The data is a mix of English, Greek and Hebrew).

use strict;
 use warnings;
 use 5.014;    
 use Win32::Console;
 use autodie; 
 use warnings qw< FATAL utf8 >;
 use Carp;
 use Carp::Always;
 use utf8;
 use feature        qw< unicode_strings>;
 use charnames        qw< :full>;
use Tie::File;

my ($i);
my ( $FileName);
my (@Tied);
binmode STDOUT, ':unix:utf8';
binmode STDERR, ':unix:utf8';
binmode $DB::OUT, ':unix:utf8' if $DB::OUT; # for the debugger
Win32::Console::OutputCP(65001);         # Set the console code page t
+o UTF8

$FileName = 'E:\\My Documents\\Technical\\Perl\\Eclipse workspace\\FIB
+I OCR\\Work\\'.
        'Tie File test res.txt';
tie @Tied, 'Tie::File', $FileName, recsep => "\x0D\x0A", discipline =>
+ ':encoding(utf8)'
            or confess 'tie @Tied failed';
$i =0;
while (<DATA>) {
    chomp;
    $Tied[$i] = $_;
    ++$i;
} # end while (<DATA>) 
$i =0;
foreach (@Tied) {
    say "$i $Tied[$i]";
    ++$i;
} # end foreach (@Tied)
untie $FileName;
__DATA__
&#964;&#953; &#954;&#940;&#957;&#949;&#964;&#949;;
&#960;&#940;&#961;&#964;&#949; &#964;&#959; &#942; &#945;&#966;&#942;&
+#963;&#964;&#949; &#964;&#959;
&#1513;&#1500;&#1493;&#1501; &#1495;&#1489;&#1512;&#1497;&#1501;
abc &#1500;&#1488; &#1499;&#1503;&#1499;&#1503; efg
&#1502;&#1514;&#1497; &#1493;&#1500;&#1488;&#1503; This is it
&#1502;&#1506;&#1499;&#1513;&#1497;&#1493; &#1500;&#1506;&#1499;&#1513
+;&#1497;&#1493; 
&#931;&#942;&#956;&#949;&#961;&#945; &#949;&#943;&#957;&#945;&#953; &#
+932;&#961;&#943;&#964;&#951;
&#920;&#941;&#955;&#969; &#957;&#945; &#966;&#940;&#969;
&#964;&#953; &#954;&#940;&#957;&#949;&#964;&#949;;
&#1513;&#1493;&#1512;&#1492; &#1502;&#1505;' 5
[download]

This produces a huge cascade of warnings: here is some:

utf8 "\xCE" does not map to Unicode at F:/Win7programs/Dwimperl/perl/l
+ib/Tie/File.pm line 917
        Tie::File::_read_record('Tie::File=HASH(0x24cb72c)') called at
+ F:/Win7programs/Dwimper
l/perl/lib/Tie/File.pm line 175
        Tie::File::_fetch('Tie::File=HASH(0x24cb72c)', 0) called at F:
+/Win7programs/Dwimperl/p
erl/lib/Tie/File.pm line 210
        Tie::File::STORE('Tie::File=HASH(0x24cb72c)', 0, '&#964;&#953;
+ &#954;&#940;&#957;&#949;&#964;&#949;;') called at tie file test
.pl line 31
utf8 "\xCF" does not map to Unicode at F:/Win7programs/Dwimperl/perl/l
+ib/Tie/File.pm line 917
        Tie::File::_read_record('Tie::File=HASH(0x24cb72c)') called at
+ F:/Win7programs/Dwimper
l/perl/lib/Tie/File.pm line 175
        Tie::File::_fetch('Tie::File=HASH(0x24cb72c)', 0) called at F:
+/Win7programs/Dwimperl/p
erl/lib/Tie/File.pm line 210
        Tie::File::STORE('Tie::File=HASH(0x24cb72c)', 0, '&#964;&#953;
+ &#954;&#940;&#957;&#949;&#964;&#949;;') called at tie file test
.pl line 31
utf8 "\xD7" does not map to Unicode at F:/Win7programs/Dwimperl/perl/l
+ib/Tie/File.pm line 917
        Tie::File::_read_record('Tie::File=HASH(0x24cb72c)') called at
+ F:/Win7programs/Dwimper
l/perl/lib/Tie/File.pm line 175
        Tie::File::_fetch('Tie::File=HASH(0x24cb72c)', 0) called at F:
+/Win7programs/Dwimperl/p
erl/lib/Tie/File.pm line 210
        Tie::File::STORE('Tie::File=HASH(0x24cb72c)', 0, '&#964;&#953;
+ &#954;&#940;&#957;&#949;&#964;&#949;;') called at tie file test
.pl line 31
utf8 "\xD7" does not map to Unicode at F:/Win7programs/Dwimperl/perl/l
+ib/Tie/File.pm line 917
        Tie::File::_read_record('Tie::File=HASH(0x24cb72c)') called at
+ F:/Win7programs/Dwimper
l/perl/lib/Tie/File.pm line 175
        Tie::File::_fetch('Tie::File=HASH(0x24cb72c)', 0) called at F:
+/Win7programs/Dwimperl/p
erl/lib/Tie/File.pm line 210
        Tie::File::STORE('Tie::File=HASH(0x24cb72c)', 0, '&#964;&#953;
+ &#954;&#940;&#957;&#949;&#964;&#949;;') called at tie file test
.pl line 31
[download]

Then it prints this on STDOUT:

0 &#964;&#953; &#954;&#940;&#957;&#949;&#964;&#949;;
1 &#960;&#940;&#961;&#964;&#949; &#964;&#959; &#942; &#945;&#966;&#942
+;&#963;&#964;&#949; &#964;&#959;
2 &#1513;&#1500;&#1493;&#1501; &#1495;&#1489;&#1512;&#1497;&#1501;
3 abc &#1500;&#1488; &#1499;&#1503;&#1499;&#1503; efg
4 &#1502;&#1514;&#1497; &#1493;&#1500;&#1488;&#1503; This is it
5 &#1502;&#1506;&#1499;&#1513;&#1497;&#1493; &#1500;&#1506;&#1499;&#15
+13;&#1497;&#1493;
6 &#931;&#942;&#956;&#949;&#961;&#945; &#949;&#943;&#957;&#945;&#953; 
+&#932;&#961;&#943;&#964;&#951;
7 &#920;&#941;&#955;&#969; &#957;&#945; &#966;&#940;&#969;
8 &#964;&#953; &#954;&#940;&#957;&#949;&#964;&#949;;
9 &#1513;&#1493;&#1512;&#1492; &#1502;&#1505;' 5
10
11
12
13
14 \xA4\x&#920;&#941;&#955;&#969;\xA8\x

15
16
17
18

19
[download]

Note that the first 9 lines are OK, but lines 10 through 19 came from nowhere!?

In addition, the output file contains corrupted data:

 &#964;&#953; &#954;&#940;&#957;&#975;N&#847;Ź&#334;&#1376;&#964;&#942
+;&#963;&#964;&#949; &#1513;&#1500; &#1495;&#1489;&#1512;&#1569;bc &#1
+500;&#1559;&#1815;&#2071;&#1815;&#2016;e&#1502;&#1514;&#1493;&#1500;&
+#1488;&#1503; This is &#1502;&#1506;&#1497;&#1493; &#1500;&#1506;&#14
+99;&#1550;&#270;&#974;&#1870;&#1423;&#957;&#945;&#953; &#932;&#961;&#
+920;&#941;&#974;&#1934;&#1120;&#966;&#975;&#334;&#1632;&#954;&#964;&#
+949;;&#1513;&#1512;&#1492; &#1502;&#1505;'



\xA4\x&#920;&#941;&#955;&#969;\xA8\x
[download]

Something is very wrong here. Either I am missing something, or Tie:File can't cope with Unicode/UTF-8?

I am runnning Strawberry Perl 5.14 on a Windows 7 system.

Many TIA - Helen

Note: cross- posted on http://stackoverflow.com/questions/13209474/

Comment on Tie::File failing with Unicode/UTF-8 encoding? Select or Download Code

Replies are listed 'Best First'.
Re: Tie::File failing with Unicode/UTF-8 encoding? by Kenosis (Priest) on Nov 03, 2012 at 16:38 UTC
Running your code produces the following console output on my machine (Windows 7, ActivePerl): 0 τι κάνετε; 1 πάρτε το ή αφ&#942 +;στε το 2 שלום חברים 3 abc לא כןכן efg 4 מתי ולאן This is it 5 מעכשיו לעכ&#15 +13;יו 6 Σήμερα είναι +Τρίτη 7 Θέλω να φάω 8 τι κάνετε; 9 שורה מס' 5 [download] The tied-file's contents: τι κάνετε; πάρτε το ή αφή& +#963;τε το שלום חברים abc לא כןכן efg מתי ולאן This is it מעכשיו לעכ&#1513 +;יו Σήμερα είναι &# +932;ρίτη Θέλω να φάω τι κάνετε; שורה מס' 5 [download] I didn't get the errors you're reporting. On a separate topic, Tie::File works well with smaller files, but performance significantly slows with files > 10M, in case you may be working with such.	[reply] [d/l] [select]
Re^2: Tie::File failing with Unicode/UTF-8 encoding? by HelenCr (Monk) on Nov 03, 2012 at 18:46 UTC
Kenosis: this is very interesting. 1. Is there a quick way to convert back the PerlMonks output, so that Unicode/UTF8 editors can show it in the original encoding? 2. It seems that in your setup (apparently, the difference is ActivePerl vs. StrawberryPerl/DWIMPerl that I am using), the problem does not occur. Could the cause be the Tie::File code? StrawberryPerl uses $VERSION = "0.97_02"; Which version is ActivePerl using?	[reply]
Re^3: Tie::File failing with Unicode/UTF-8 encoding? by Kenosis (Priest) on Nov 03, 2012 at 19:03 UTC
I don't know about these things... Yet... ActivePerl uses $VERSION = "0.98" for Tie::File. Perhaps upgrading the Module is worth a try (or `tie`).	[reply] [d/l]
Re^4: Tie::File failing with Unicode/UTF-8 encoding? by HelenCr (Monk) on Nov 03, 2012 at 19:27 UTC
Re: Tie::File failing with Unicode/UTF-8 encoding? by Anonymous Monk on Nov 04, 2012 at 00:51 UTC
I don't think it is supported. The option you used is called discipline and that is perl 5.6 talk , that feature is now called PerlIO layers Also, that feature is not documented in Tie::File ; And, in 5.6, that feature only supported :raw and :crlf, as :encoding did not exist ; You should file a bug report (did I say that?) But you can work around this limitation by doing your own encoding/decoding. This seems to work #!/usr/bin/perl -- use Path::Class; use constant THISFILE => file( __FILE__ )->absolute->stringify; use constant THISDIR => file( THISFILE )->dir->stringify; use strict; use warnings; use Tie::File; use Fcntl 'O_RDWR', 'O_CREAT'; use Encode; chdir THISDIR or die Fudge( 'chdir', THISDIR ); Main( @ARGV ); exit( 0 ); sub Main { binmode STDOUT; my $FileName = "".file( THISFILE . '.tf.utf8.txt' ); tie my(@Tied), 'Tie::File', $FileName, recsep => "\x0D\x0A", mode => O_RDWR \| O_CREAT, or die Fudge( tie => $FileName ); if( @Tied > 2 ){ print 'size ', int @Tied, "\n"; print map { decode('UTF-8', $_ ); "($_)" } @Tied; } else { push @Tied, encode('UTF-8', chr $_) for 272 .. 30000; } untie @Tied; } sub Fudge { use Errno(); join qq/\n/, "Error @_", map { " $_" } int( $! ) . q/ / . $!, int( $^E ) . q/ / . $^E, grep( { $!{$_} } keys %! ), q/ /; } __END__ [download] So you might subclass it like this (untested) `package Tie::File::UTF8; use parent qw[ Tie::File ]; sub TIEARRAY { my $class = shift; $class->SUPER::new( @_, recsep => "\x0D\x0A", discipline => ':raw', ); } sub FETCH { my( $self, $n ) = @_; my $rec = $self->SUPER::FETCH( $n ); $rec = decode( 'UTF-8', $rec ); $rec; } sub STORE { my( $self, $n, $rec ) = @_; $rec = encode( 'UTF-8', $rec ); $self->SUPER::FETCH( $n, $rec ); }` [download] I don't know about more exotic encodings , but it seems UTF-8 could be supported easily	[reply] [d/l] [select]
Re^2: Tie::File failing with Unicode/UTF-8 encoding? by remiah (Hermit) on Nov 04, 2012 at 02:24 UTC
Hello. I thought getter, setter method will avoid this trouble. Below was my trial. MyTieFile.pm Read more... (1243 Bytes) And test script Read more... (929 Bytes) And I saw your post. So, overriding STORE,FETCH seems more elegant. I tried your untested code but it seems producing same warnings message for utf8 characters. Why??? This seems to be a known problem because there is stackoverflow thread and I see the name "ikegami" I hope "ikegami" or someone explains this problem , a little more...	[reply] [d/l] [select]
Re^3: Tie::File failing with Unicode/UTF-8 encoding? by HelenCr (Monk) on Nov 05, 2012 at 17:20 UTC
remiah: I studied your module and test script: you've done a very good job - it's working. Thank you for that. But, what this effectively does (as UNK noted in his answer here: http://stackoverflow.com/questions/13209474/ ), is re-encoding the data before inserting it into the tied array and the tied file; so the array does not contain Unicode data in internal Perl representation, but instead simply contains the imported UTF-8 strings. Now in my project, I am doing regex comparisons and substitutions against the tied array; so if I go this route, I'll have to re-decode the array element before any processing, and re-encode it again. What do you think? Many thanks for your well-thought-out answer. Helen	[reply]
Re^4: Tie::File failing with Unicode/UTF-8 encoding? by remiah (Hermit) on Nov 05, 2012 at 22:10 UTC
Re^5: Tie::File failing with Unicode/UTF-8 encoding? by HelenCr (Monk) on Nov 07, 2012 at 10:29 UTC
Some notes below your chosen depth have not been shown here
Re^3: Tie::File failing with Unicode/UTF-8 encoding? by Anonymous Monk on Nov 04, 2012 at 12:55 UTC
And I saw your post. So, overriding STORE,FETCH seems more elegant. I tried your untested code but it seems producing same warnings message for utf8 characters. Why??? I don't know , I didn't test it :)	[reply]
Re^4: Tie::File failing with Unicode/UTF-8 encoding? by HelenCr (Monk) on Nov 04, 2012 at 15:45 UTC
Re^5: Tie::File failing with Unicode/UTF-8 encoding? by Anonymous Monk on Nov 04, 2012 at 16:11 UTC
Re: Tie::File failing with Unicode/UTF-8 encoding? by Anonymous Monk on Nov 03, 2012 at 16:55 UTC
Update: how do I make the PerlMonks web site show the foreign fonts, instead of the HEX? You can't, sorry The good news is you don't really need to :) simply post proper Dumps ;) perl -MData::Dump -MFile::Slurp -e " dd scalar read_file shift, { qw/ binmode :raw / }; " input.txt > input_txt_as_perl.pl	[reply]
Re: Tie::File failing with Unicode/UTF-8 encoding? by Kenosis (Priest) on Nov 03, 2012 at 15:34 UTC
~~Crossed posted on StackOverflow.~~ ~~"It is considered polite to inform about crossposting so people not attending both sites do not waste their time solving a problem already closed at the other end of the internet."~~ My apologies. I failed to see your notice at the bottom of your posting...	[reply]
Re^2: Tie::File failing with Unicode/UTF-8 encoding? by Anonymous Monk on Nov 03, 2012 at 15:38 UTC
The OP did note that she cross-posted.	[reply]
Re^3: Tie::File failing with Unicode/UTF-8 encoding? by Kenosis (Priest) on Nov 03, 2012 at 15:50 UTC
Indeed. I failed to note her note. Corrected. Thanks...	[reply]

Back to Seekers of Perl Wisdom