Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change

IO::Compress::Gzip and unicode

by Anonymous Monk
on Mar 02, 2018 at 07:40 UTC ( #1210218=perlquestion: print w/replies, xml ) Need Help??
Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

use strict; use FileHandle; use IO::Compress::Gzip; my $unicode_string = "Smiley Face: \x{263A}\n"; writefile({ 'filename' => "/tmp/out", 'gzip' => 0, 'data' => $unicode_string, }); writefile({ 'filename' => "/tmp/out.gz", 'gzip' => 1, 'data' => $unicode_string, }); sub writefile { my($opts) = @_; my $fh = ($opts->{'gzip'}) ? IO::Compress::Gzip->new( FileHandle->new("> $opts->{'filename'}"), ) : FileHandle->new("> $opts->{'filename'}"); binmode($fh, ':utf8'); print $fh $opts->{'data'}; $fh->close; } __DATA__

First subroutine call succeeds and produces a /tmp/out file with the expected content.

Seoond subroutine call fails with the message:
Wide character in IO::Compress::Gzip::write: at <program_name> line 29.

Line 29 is the 'print' statement.

Documentation suggests IO::Compress::Gzip::binmode is a no-op.

Using "Encode::decode_utf8($opts->{'data'});" doesn't work either.

The second subroutine call produces a valid, compressed as expected /tmp/out.gz file as long as $unicode_string doesn't actually contain any unicode characters.

How do I make this work? I'd much prefer to compress in perl rather than gzip files after writing them, as the real-world code with the issue demonstrated by this minimum-reproducible test case deals with large data volumes and performance is a concern.

2018-03-03 Athanasius removed the question text from the main code block and added paragraph tags

Replies are listed 'Best First'.
Re: IO::Compress::Gzip and unicode
by salva (Abbot) on Mar 02, 2018 at 08:49 UTC
    You can use PerlIO layers to do that.

    PerlIO::via::gzip provides on the fly data compression (and it uses IO::Compress::Gzip under the hood).

    # untested! sub writefile { my($opts) = @_; open my $fh, '>', $opts->{filename} or die $!; binmode($fh, ':via(gzip)') if $opts->{'gzip'}; binmode($fh, ':utf8'); print $fh $opts->{'data'}; $fh->close; }
Re: IO::Compress::Gzip and unicode
by Corion (Pope) on Mar 02, 2018 at 08:35 UTC

    The easy approach would be to use Encode::encode to convert your string to octets before writing it to the file:

    my $unicode_string = "Smiley Face: \x{263A}\n"; my $bytes = encode('UTF-8', $unicode_string); binmode $fh, ':raw'; print {$fh} $bytes;

    But I think that the binmode ':utf8' already should do that. Maybe there is a difference between :utf8 and :encoding(UTF-8), so maybe try:

    binmode $fh, ':encoding(UTF-8)';

    in your code instead.

    But as you already looked at the documentation of IO::Compress::Gzip and it doesn't have a proper binmode implementation, you will need to do that yourself I fear.

Re: IO::Compress::Gzip and unicode
by pmqs (Pilgrim) on Mar 02, 2018 at 23:35 UTC

    As things stand you need to explicitly encode the data to utf8. To do that you need to use encode_utf8 rather than decode_utf8.

    Change this line

    print $fh $opts->{'data'};

    to this

    print $fh Encode::encode_utf8($opts->{'data'}) ;

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1210218]
Approved by haukex
Front-paged by haukex
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others musing on the Monastery: (5)
As of 2018-12-09 23:39 GMT
Find Nodes?
    Voting Booth?
    How many stories does it take before you've heard them all?

    Results (46 votes). Check out past polls.

    • (Sep 10, 2018 at 22:53 UTC) Welcome new users!