Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation

Re: Tie::File failing with Unicode/UTF-8 encoding?

by Anonymous Monk
on Nov 04, 2012 at 00:51 UTC ( #1002151=note: print w/replies, xml ) Need Help??

in reply to Tie::File failing with Unicode/UTF-8 encoding?

I don't think it is supported. The option you used is called discipline and that is perl 5.6 talk , that feature is now called PerlIO layers

Also, that feature is not documented in Tie::File ; And, in 5.6, that feature only supported :raw and :crlf, as :encoding did not exist ; You should file a bug report (did I say that?)

But you can work around this limitation by doing your own encoding/decoding. This seems to work

#!/usr/bin/perl -- use Path::Class; use constant THISFILE => file( __FILE__ )->absolute->stringify; use constant THISDIR => file( THISFILE )->dir->stringify; use strict; use warnings; use Tie::File; use Fcntl 'O_RDWR', 'O_CREAT'; use Encode; chdir THISDIR or die Fudge( 'chdir', THISDIR ); Main( @ARGV ); exit( 0 ); sub Main { binmode STDOUT; my $FileName = "".file( THISFILE . '.tf.utf8.txt' ); tie my(@Tied), 'Tie::File', $FileName, recsep => "\x0D\x0A", mode => O_RDWR | O_CREAT, or die Fudge( tie => $FileName ); if( @Tied > 2 ){ print 'size ', int @Tied, "\n"; print map { decode('UTF-8', $_ ); "($_)" } @Tied; } else { push @Tied, encode('UTF-8', chr $_) for 272 .. 30000; } untie @Tied; } sub Fudge { use Errno(); join qq/\n/, "Error @_", map { " $_" } int( $! ) . q/ / . $!, int( $^E ) . q/ / . $^E, grep( { $!{$_} } keys %! ), q/ /; } __END__

So you might subclass it like this (untested)

package Tie::File::UTF8; use parent qw[ Tie::File ]; sub TIEARRAY { my $class = shift; $class->SUPER::new( @_, recsep => "\x0D\x0A", discipline => ':raw', ); } sub FETCH { my( $self, $n ) = @_; my $rec = $self->SUPER::FETCH( $n ); $rec = decode( 'UTF-8', $rec ); $rec; } sub STORE { my( $self, $n, $rec ) = @_; $rec = encode( 'UTF-8', $rec ); $self->SUPER::FETCH( $n, $rec ); }

I don't know about more exotic encodings , but it seems UTF-8 could be supported easily

Replies are listed 'Best First'.
Re^2: Tie::File failing with Unicode/UTF-8 encoding?
by remiah (Hermit) on Nov 04, 2012 at 02:24 UTC


    I thought getter, setter method will avoid this trouble. Below was my trial.

    And test script
    And I saw your post. So, overriding STORE,FETCH seems more elegant. I tried your untested code but it seems producing same warnings message for utf8 characters.


    This seems to be a known problem because there is stackoverflow thread and I see the name "ikegami"

    I hope "ikegami" or someone explains this problem , a little more...

      remiah: I studied your module and test script: you've done a very good job - it's working. Thank you for that.

      But, what this effectively does (as UNK noted in his answer here: ), is re-encoding the data before inserting it into the tied array and the tied file; so the array does not contain Unicode data in internal Perl representation, but instead simply contains the imported UTF-8 strings.

      Now in my project, I am doing regex comparisons and substitutions against the tied array; so if I go this route, I'll have to re-decode the array element before any processing, and re-encode it again.

      What do you think?

      Many thanks for your well-thought-out answer.


        Hello, HellenCr.

        I have read Unk's post at stackoevrflow. This is not trivial problem for me having been troubled and struggled with encoding/decoding issue for a long time. It seems Glu monks not noticing this thread...

        Unk says

        1. Encode manually before handing off to the tied array
        2. Figure out what the issue is with Tie::File
        No 1 must be like mine, wrapping Tie::File with accessor methods. For No.2, I wish some superior monks pursuit whether it is really seek problem, as Unk says. And there could be No.3 using DB_File module.
        #!/usr/bin/perl # DB_File example from +530.html use 5.010; use strict; use warnings; use utf8; use Fcntl; use DB_File; use DBM_Filter; my @Tied; my $Filename='087.txt'; my $db = tie @Tied, 'DB_File', $Filename, O_CREAT | O_RDWR, 0644, $DB_ +RECNO; $db->Filter_Push('encode' => 'UTF-8'); binmode STDOUT,':encoding(UTF-8)'; my $i =0; while (<DATA>) { chomp; $Tied[$i] = $_; ++$i; } # end while (<DATA>) $i =0; foreach (@Tied) { say "$i $Tied[$i]" if /&#964;/; #greek letter... ++$i; } # end foreach (@Tied) $db->Filter_Pop(); untie $Filename; __DATA__ your greek input
        Referenced page written in Japanese. Filter_Push seems working fine for me... but it is really greek for me.


      And I saw your post. So, overriding STORE,FETCH seems more elegant. I tried your untested code but it seems producing same warnings message for utf8 characters.


      I don't know , I didn't test it :)

        Most probably, this is the warning (or similar):

        utf8 "\xCE" does not map to Unicode at F:/Win7programs/Dwimperl/perl/l +ib/Tie/ line 917 Tie::File::_read_record('Tie::File=HASH(0x24cb72c)') called at + F:/Win7programs/Dwimper l/perl/lib/Tie/ line 175 Tie::File::_fetch('Tie::File=HASH(0x24cb72c)', 0) called at F: +/Win7programs/Dwimperl/p erl/lib/Tie/ line 210 Tie::File::STORE('Tie::File=HASH(0x24cb72c)', 0, '&#964;&#953; + &#954;&#940;&#957;&#949;&#964;&#949;;') called at tie file test .pl line 31

        Can you look into it and try to suggest the cause?

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1002151]
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others meditating upon the Monastery: (6)
As of 2018-04-23 01:11 GMT
Find Nodes?
    Voting Booth?