Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid

When is the utf8-flag turned on?

by uwevoelker (Pilgrim)
on Aug 20, 2004 at 19:13 UTC ( #384672=perlquestion: print w/replies, xml ) Need Help??

uwevoelker has asked for the wisdom of the Perl Monks concerning the following question:

Dear monks,

I have a string (actually a complex data structure with strings in it) and encode it with Encode::encode('iso-8859-15', $string). The Encode pod says:
When you encode, the resulting utf8 flag is always off.

That's fine, because I want the utf8 flag off. But later, after some processing, the utf8 flag is on! Arrgh!

And so I ask: when does perl (v5.8.0) turn this utf8 flag on?
Other strings and constants (here: constant strings in the perl code) are stored with utf8 off. I used Devel::Peek to check this.

Thanks, Uwe

EDIT: I'm using mod_perl (v1.27).

Replies are listed 'Best First'.
Re: When is the utf8-flag turned on?
by ysth (Canon) on Aug 20, 2004 at 20:05 UTC
    Usually only when a string is operated on with data that itself has the utf8 flag on. In perl5.8.0 (but not any later version) this can include input from a file if you are using a utf8 locale.

    Once the flag is on, does

    use Data::Dumper; $Data::Dumper::Useqq = 1; print Dumper $string;
    show any > 8bit characters? (e.g. "foo\x{1ff}bar"). If so, where does it come from? If not, upgrading perl, using a non-utf8 locale, or binmode'ing your input filehandles may help.
      I'm not using an utf8 locale, nor does I use 'use utf8'.
      My strings do not contain > 8 bit characters, but they contain 8 bit characters (German "Umlaute" - ä, ö, ü, ß).
        I'm not using an utf8 locale, nor does I use 'use utf8'.
        The comment about 'use utf8' here makes me think it's just possible you are mistaken. "Using a utf8 locale" isn't a perl thing, it's an operating system thing, and several OS's have taken to setting utf8 locales as a default. Check your environment variables LANG, LC_CALL, LC_CTYPE, and LANGUAGE.

        If that's not it, you'll have to dig a little bit into what happens to this variable; at what point does it get the utf8 flag turned on?

Re: When is the utf8-flag turned on?
by graff (Chancellor) on Aug 20, 2004 at 21:00 UTC
    I don't think you're showing us enough code to make the problem clear. I hope you understand that the proper use of the "encode()" function is as follows:
    my $octets = encode( 'iso-8859-15', $utf8string );
    In this case, it's $octets that has the utf8 flag turned off (and contains the iso-8859-15 character data); the flag on $utf8string, and the string it contains, are unaffected by this operation.
      I was using encode() correctly. I found my mistake - and it was a strange one!

      I was using encode() in a recursive function (to convert my datastructure entirely to iso-8859-15).
      sub recode { my ($enc, $data) = @_; my $ref = ref($data); if ($ref) { if ($ref eq 'ARRAY') { # Array-Ref my @array = map { recode($enc, $_) } @$data; return \@array; } elsif ($ref eq 'SCALAR') { # Scalar-Ref my $scalar = recode($enc, $data); return \$scalar; } elsif ($ref eq 'HASH') { # Hash-Ref my %hash = (); while (my ($key, $value) = each %$data) { $hash{recode($enc, $key)} = recode($enc, $value); } return \%hash; } else { # Object - XYZ::, ZYX:: if ($ref =~ /^(XYZ|ZYX)::/) { my $object = bless({}, $ref); while (my ($key, $value) = each %$data) { $object->{recode($enc, $key)} = recode($enc, $valu +e); } return $object; } else { warn "recode(): $ref nicht unterstützt"; return $data; } } } else { # unbedingt Variable verwenden, sonst wird # das UTF8-Flag nicht gelöscht $data = Encode::encode($enc, $data); return $data; } }
      (Please ignore the German comments.)

      The problem was the last else-block. I was using "return Encode::encode($enc, $data);" without a variable assignment. And this code did not clear the utf8 flag! Only with a variable assignment the utf8 flag was cleared.

      I don't know why this happens. I tried to reproduce this behaviour to report a bug, but I failed:
      #!/usr/bin/perl -w use strict; use Devel::Peek; use Encode (); my $str = 'Übel'; Dump($str); $str = Encode::decode('iso-8859-15', $str); Dump($str); my $str2 = do_encode($str); Dump($str2); my $str3 = do_encode_with_tmp($str); Dump($str3); sub do_encode { my $text = shift; return Encode::encode('iso-8859-15', $text); } sub do_encode_with_tmp { my $text = shift; my $tmp = Encode::encode('iso-8859-15', $text); return $tmp; }
      Here both subs clear the utf8 flag correctly....

Re: When is the utf8-flag turned on?
by benizi (Hermit) on Aug 21, 2004 at 20:23 UTC

    v5.8.0 had some issues with its handling of UTF-8 that were fixed in 5.8.1. I would highly recommend upgrading, if possible. If not, here are some pointers to changes:

    Also in the delta is this information about updates to the Encode module:

    Significant updates on the encoding pragma functionality (tr/// and the DATA filehandle, formats).
    If a filehandle has been marked as to have an encoding, unmappable characters are detected already during input, not later (when the corrupted data is being used).

    Hope this helps

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://384672]
Approved by been42
Front-paged by Courage
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (3)
As of 2022-05-29 08:43 GMT
Find Nodes?
    Voting Booth?
    Do you prefer to work remotely?

    Results (101 votes). Check out past polls.