Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

Tie::File failing with Unicode/UTF-8 encoding?

by HelenCr (Monk)
on Nov 03, 2012 at 13:32 UTC ( #1002104=perlquestion: print w/ replies, xml ) Need Help??
HelenCr has asked for the wisdom of the Perl Monks concerning the following question:

Dear highly esteemed PerlMonks

Update: how do I make the PerlMonks web site show the foreign fonts, instead of the HEX?

I am working on a project which deals with data in foreign languages. My Perl scripts were running fine.

I then wanted to use Tie::File, since this is a neat concept (and saves time and coding).

It seems that Tie:File is failing under Unicode/UTF-8 (unless I am missing something).

Here is a program which depicts the problem: (The data is a mix of English, Greek and Hebrew).

use strict; use warnings; use 5.014; use Win32::Console; use autodie; use warnings qw< FATAL utf8 >; use Carp; use Carp::Always; use utf8; use feature qw< unicode_strings>; use charnames qw< :full>; use Tie::File; my ($i); my ( $FileName); my (@Tied); binmode STDOUT, ':unix:utf8'; binmode STDERR, ':unix:utf8'; binmode $DB::OUT, ':unix:utf8' if $DB::OUT; # for the debugger Win32::Console::OutputCP(65001); # Set the console code page t +o UTF8 $FileName = 'E:\\My Documents\\Technical\\Perl\\Eclipse workspace\\FIB +I OCR\\Work\\'. 'Tie File test res.txt'; tie @Tied, 'Tie::File', $FileName, recsep => "\x0D\x0A", discipline => + ':encoding(utf8)' or confess 'tie @Tied failed'; $i =0; while (<DATA>) { chomp; $Tied[$i] = $_; ++$i; } # end while (<DATA>) $i =0; foreach (@Tied) { say "$i $Tied[$i]"; ++$i; } # end foreach (@Tied) untie $FileName; __DATA__ &#964;&#953; &#954;&#940;&#957;&#949;&#964;&#949;; &#960;&#940;&#961;&#964;&#949; &#964;&#959; &#942; &#945;&#966;&#942;& +#963;&#964;&#949; &#964;&#959; &#1513;&#1500;&#1493;&#1501; &#1495;&#1489;&#1512;&#1497;&#1501; abc &#1500;&#1488; &#1499;&#1503;&#1499;&#1503; efg &#1502;&#1514;&#1497; &#1493;&#1500;&#1488;&#1503; This is it &#1502;&#1506;&#1499;&#1513;&#1497;&#1493; &#1500;&#1506;&#1499;&#1513 +;&#1497;&#1493; &#931;&#942;&#956;&#949;&#961;&#945; &#949;&#943;&#957;&#945;&#953; &# +932;&#961;&#943;&#964;&#951; &#920;&#941;&#955;&#969; &#957;&#945; &#966;&#940;&#969; &#964;&#953; &#954;&#940;&#957;&#949;&#964;&#949;; &#1513;&#1493;&#1512;&#1492; &#1502;&#1505;' 5

This produces a huge cascade of warnings: here is some:

utf8 "\xCE" does not map to Unicode at F:/Win7programs/Dwimperl/perl/l +ib/Tie/File.pm line 917 Tie::File::_read_record('Tie::File=HASH(0x24cb72c)') called at + F:/Win7programs/Dwimper l/perl/lib/Tie/File.pm line 175 Tie::File::_fetch('Tie::File=HASH(0x24cb72c)', 0) called at F: +/Win7programs/Dwimperl/p erl/lib/Tie/File.pm line 210 Tie::File::STORE('Tie::File=HASH(0x24cb72c)', 0, '&#964;&#953; + &#954;&#940;&#957;&#949;&#964;&#949;;') called at tie file test .pl line 31 utf8 "\xCF" does not map to Unicode at F:/Win7programs/Dwimperl/perl/l +ib/Tie/File.pm line 917 Tie::File::_read_record('Tie::File=HASH(0x24cb72c)') called at + F:/Win7programs/Dwimper l/perl/lib/Tie/File.pm line 175 Tie::File::_fetch('Tie::File=HASH(0x24cb72c)', 0) called at F: +/Win7programs/Dwimperl/p erl/lib/Tie/File.pm line 210 Tie::File::STORE('Tie::File=HASH(0x24cb72c)', 0, '&#964;&#953; + &#954;&#940;&#957;&#949;&#964;&#949;;') called at tie file test .pl line 31 utf8 "\xD7" does not map to Unicode at F:/Win7programs/Dwimperl/perl/l +ib/Tie/File.pm line 917 Tie::File::_read_record('Tie::File=HASH(0x24cb72c)') called at + F:/Win7programs/Dwimper l/perl/lib/Tie/File.pm line 175 Tie::File::_fetch('Tie::File=HASH(0x24cb72c)', 0) called at F: +/Win7programs/Dwimperl/p erl/lib/Tie/File.pm line 210 Tie::File::STORE('Tie::File=HASH(0x24cb72c)', 0, '&#964;&#953; + &#954;&#940;&#957;&#949;&#964;&#949;;') called at tie file test .pl line 31 utf8 "\xD7" does not map to Unicode at F:/Win7programs/Dwimperl/perl/l +ib/Tie/File.pm line 917 Tie::File::_read_record('Tie::File=HASH(0x24cb72c)') called at + F:/Win7programs/Dwimper l/perl/lib/Tie/File.pm line 175 Tie::File::_fetch('Tie::File=HASH(0x24cb72c)', 0) called at F: +/Win7programs/Dwimperl/p erl/lib/Tie/File.pm line 210 Tie::File::STORE('Tie::File=HASH(0x24cb72c)', 0, '&#964;&#953; + &#954;&#940;&#957;&#949;&#964;&#949;;') called at tie file test .pl line 31

Then it prints this on STDOUT:

0 &#964;&#953; &#954;&#940;&#957;&#949;&#964;&#949;; 1 &#960;&#940;&#961;&#964;&#949; &#964;&#959; &#942; &#945;&#966;&#942 +;&#963;&#964;&#949; &#964;&#959; 2 &#1513;&#1500;&#1493;&#1501; &#1495;&#1489;&#1512;&#1497;&#1501; 3 abc &#1500;&#1488; &#1499;&#1503;&#1499;&#1503; efg 4 &#1502;&#1514;&#1497; &#1493;&#1500;&#1488;&#1503; This is it 5 &#1502;&#1506;&#1499;&#1513;&#1497;&#1493; &#1500;&#1506;&#1499;&#15 +13;&#1497;&#1493; 6 &#931;&#942;&#956;&#949;&#961;&#945; &#949;&#943;&#957;&#945;&#953; +&#932;&#961;&#943;&#964;&#951; 7 &#920;&#941;&#955;&#969; &#957;&#945; &#966;&#940;&#969; 8 &#964;&#953; &#954;&#940;&#957;&#949;&#964;&#949;; 9 &#1513;&#1493;&#1512;&#1492; &#1502;&#1505;' 5 10 11 12 13 14 \xA4\x&#920;&#941;&#955;&#969;\xA8\x 15 16 17 18 19

Note that the first 9 lines are OK, but lines 10 through 19 came from nowhere!?

In addition, the output file contains corrupted data:

&#964;&#953; &#954;&#940;&#957;&#975;N&#847;&#334;&#1376;&#964;&#942 +;&#963;&#964;&#949; &#1513;&#1500; &#1495;&#1489;&#1512;&#1569;bc &#1 +500;&#1559;&#1815;&#2071;&#1815;&#2016;e&#1502;&#1514;&#1493;&#1500;& +#1488;&#1503; This is &#1502;&#1506;&#1497;&#1493; &#1500;&#1506;&#14 +99;&#1550;&#270;&#974;&#1870;&#1423;&#957;&#945;&#953; &#932;&#961;&# +920;&#941;&#974;&#1934;&#1120;&#966;&#975;&#334;&#1632;&#954;&#964;&# +949;;&#1513;&#1512;&#1492; &#1502;&#1505;' \xA4\x&#920;&#941;&#955;&#969;\xA8\x

Something is very wrong here. Either I am missing something, or Tie:File can't cope with Unicode/UTF-8?

I am runnning Strawberry Perl 5.14 on a Windows 7 system.

Many TIA - Helen

Note: cross- posted on http://stackoverflow.com/questions/13209474/

Comment on Tie::File failing with Unicode/UTF-8 encoding?
Select or Download Code
Re: Tie::File failing with Unicode/UTF-8 encoding?
by Kenosis (Priest) on Nov 03, 2012 at 15:34 UTC

    Crossed posted on StackOverflow.

    "It is considered polite to inform about crossposting so people not attending both sites do not waste their time solving a problem already closed at the other end of the internet."

    My apologies. I failed to see your notice at the bottom of your posting...

      The OP did note that she cross-posted.

        Indeed. I failed to note her note. Corrected. Thanks...

Re: Tie::File failing with Unicode/UTF-8 encoding?
by Kenosis (Priest) on Nov 03, 2012 at 16:38 UTC

    Running your code produces the following console output on my machine (Windows 7, ActivePerl):

    0 &#964;&#953; &#954;&#940;&#957;&#949;&#964;&#949;; 1 &#960;&#940;&#961;&#964;&#949; &#964;&#959; &#942; &#945;&#966;&#942 +;&#963;&#964;&#949; &#964;&#959; 2 &#1513;&#1500;&#1493;&#1501; &#1495;&#1489;&#1512;&#1497;&#1501; 3 abc &#1500;&#1488; &#1499;&#1503;&#1499;&#1503; efg 4 &#1502;&#1514;&#1497; &#1493;&#1500;&#1488;&#1503; This is it 5 &#1502;&#1506;&#1499;&#1513;&#1497;&#1493; &#1500;&#1506;&#1499;&#15 +13;&#1497;&#1493; 6 &#931;&#942;&#956;&#949;&#961;&#945; &#949;&#943;&#957;&#945;&#953; +&#932;&#961;&#943;&#964;&#951; 7 &#920;&#941;&#955;&#969; &#957;&#945; &#966;&#940;&#969; 8 &#964;&#953; &#954;&#940;&#957;&#949;&#964;&#949;; 9 &#1513;&#1493;&#1512;&#1492; &#1502;&#1505;' 5

    The tied-file's contents:

    &#964;&#953; &#954;&#940;&#957;&#949;&#964;&#949;; &#960;&#940;&#961;&#964;&#949; &#964;&#959; &#942; &#945;&#966;&#942;& +#963;&#964;&#949; &#964;&#959; &#1513;&#1500;&#1493;&#1501; &#1495;&#1489;&#1512;&#1497;&#1501; abc &#1500;&#1488; &#1499;&#1503;&#1499;&#1503; efg &#1502;&#1514;&#1497; &#1493;&#1500;&#1488;&#1503; This is it &#1502;&#1506;&#1499;&#1513;&#1497;&#1493; &#1500;&#1506;&#1499;&#1513 +;&#1497;&#1493; &#931;&#942;&#956;&#949;&#961;&#945; &#949;&#943;&#957;&#945;&#953; &# +932;&#961;&#943;&#964;&#951; &#920;&#941;&#955;&#969; &#957;&#945; &#966;&#940;&#969; &#964;&#953; &#954;&#940;&#957;&#949;&#964;&#949;; &#1513;&#1493;&#1512;&#1492; &#1502;&#1505;' 5

    I didn't get the errors you're reporting.

    On a separate topic, Tie::File works well with smaller files, but performance significantly slows with files > 10M, in case you may be working with such.

      Kenosis: this is very interesting.

      1. Is there a quick way to convert back the PerlMonks output, so that Unicode/UTF8 editors can show it in the original encoding?

      2. It seems that in your setup (apparently, the difference is ActivePerl vs. StrawberryPerl/DWIMPerl that I am using), the problem does not occur. Could the cause be the Tie::File code? StrawberryPerl uses $VERSION = "0.97_02"; Which version is ActivePerl using?

        1. I don't know about these things... Yet...
        2. ActivePerl uses $VERSION = "0.98" for Tie::File. Perhaps upgrading the Module is worth a try (or tie).
Re: Tie::File failing with Unicode/UTF-8 encoding?
by Anonymous Monk on Nov 03, 2012 at 16:55 UTC

    Update: how do I make the PerlMonks web site show the foreign fonts, instead of the HEX?

    You can't, sorry

    The good news is you don't really need to :) simply post proper Dumps ;)

    perl -MData::Dump -MFile::Slurp -e " dd scalar read_file shift, { qw/ binmode :raw / }; " input.txt > input_txt_as_perl.pl

Re: Tie::File failing with Unicode/UTF-8 encoding?
by Anonymous Monk on Nov 04, 2012 at 00:51 UTC

    I don't think it is supported. The option you used is called discipline and that is perl 5.6 talk , that feature is now called PerlIO layers

    Also, that feature is not documented in Tie::File ; And, in 5.6, that feature only supported :raw and :crlf, as :encoding did not exist ; You should file a bug report (did I say that?)

    But you can work around this limitation by doing your own encoding/decoding. This seems to work

    #!/usr/bin/perl -- use Path::Class; use constant THISFILE => file( __FILE__ )->absolute->stringify; use constant THISDIR => file( THISFILE )->dir->stringify; use strict; use warnings; use Tie::File; use Fcntl 'O_RDWR', 'O_CREAT'; use Encode; chdir THISDIR or die Fudge( 'chdir', THISDIR ); Main( @ARGV ); exit( 0 ); sub Main { binmode STDOUT; my $FileName = "".file( THISFILE . '.tf.utf8.txt' ); tie my(@Tied), 'Tie::File', $FileName, recsep => "\x0D\x0A", mode => O_RDWR | O_CREAT, or die Fudge( tie => $FileName ); if( @Tied > 2 ){ print 'size ', int @Tied, "\n"; print map { decode('UTF-8', $_ ); "($_)" } @Tied; } else { push @Tied, encode('UTF-8', chr $_) for 272 .. 30000; } untie @Tied; } sub Fudge { use Errno(); join qq/\n/, "Error @_", map { " $_" } int( $! ) . q/ / . $!, int( $^E ) . q/ / . $^E, grep( { $!{$_} } keys %! ), q/ /; } __END__

    So you might subclass it like this (untested)

    package Tie::File::UTF8; use parent qw[ Tie::File ]; sub TIEARRAY { my $class = shift; $class->SUPER::new( @_, recsep => "\x0D\x0A", discipline => ':raw', ); } sub FETCH { my( $self, $n ) = @_; my $rec = $self->SUPER::FETCH( $n ); $rec = decode( 'UTF-8', $rec ); $rec; } sub STORE { my( $self, $n, $rec ) = @_; $rec = encode( 'UTF-8', $rec ); $self->SUPER::FETCH( $n, $rec ); }

    I don't know about more exotic encodings , but it seems UTF-8 could be supported easily

      Hello.

      I thought getter, setter method will avoid this trouble. Below was my trial. MyTieFile.pm

      And test script
      And I saw your post. So, overriding STORE,FETCH seems more elegant. I tried your untested code but it seems producing same warnings message for utf8 characters.

      Why???

      This seems to be a known problem because there is stackoverflow thread and I see the name "ikegami"

      I hope "ikegami" or someone explains this problem , a little more...

        And I saw your post. So, overriding STORE,FETCH seems more elegant. I tried your untested code but it seems producing same warnings message for utf8 characters.

        Why???

        I don't know , I didn't test it :)

        remiah: I studied your module and test script: you've done a very good job - it's working. Thank you for that.

        But, what this effectively does (as UNK noted in his answer here: http://stackoverflow.com/questions/13209474/ ), is re-encoding the data before inserting it into the tied array and the tied file; so the array does not contain Unicode data in internal Perl representation, but instead simply contains the imported UTF-8 strings.

        Now in my project, I am doing regex comparisons and substitutions against the tied array; so if I go this route, I'll have to re-decode the array element before any processing, and re-encode it again.

        What do you think?

        Many thanks for your well-thought-out answer.

        Helen

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1002104]
Approved by BrowserUk
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others having an uproarious good time at the Monastery: (19)
As of 2014-10-21 17:30 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    For retirement, I am banking on:










    Results (106 votes), past polls