comment on

Why is the "ÿ" not printing correctly here in my terminal?

Your terminal expects UTF-8. You printed chr(0xFF), which is not the UTF-8 encoding of "ÿ".

You can encode it yourself, or you ask Perl to do it using the following:

use open ':std', ':encoding(UTF-8)';
[download]

ord() returns 255 for ÿ, a single byte. Encode thinks this is utf-8, but isn't this actually utf-16?

It's not UTF-8 (which would be C3 BF). is_utf8($string) does not indicate whether $string contains UTF-8.

It's not UTF-16 (which would be 00 FF or FF 00 depending on endianness).

Decoding string (as use utf8; does for literals) results in Unicode Code Points ("ÿ" is U+00FF).

This actually looks like valid UTF-8 to me and Encode agrees. Is that correct?

That is the UTF-8 encoding of "Queensrÿche", though it is incorrect to say that is_utf8 signifies that Encode agrees.

Text::Unaccent::PurePerl does not "unaccent" it properly. Why not?

Tools that work with text (such as regular expressions and Text::Unaccent::PurePerl) usually expect the text to be provided as strings of Unicode Code Points, not encoded using UTF-8.

Is there a way to safely convert them to the same encoding?

Aformentioned

use open ':std', ':encoding(UTF-8)';
[download]

will also tell Perl to decode bytes read from file handles.

use utf8;
use encoding ':std', ':encoding(UTF-8)';

use JSON::XS qw( decode_json encode_json );

my $s = "Queensrÿche";
printf("U+%v04X %s\n", $s, $s);

{
   # Uses encoding specified by "use open".
   open(my $fh, '>', 'foo.txt') or die $!;
   print($fh "$s\n");
}

{
   # Uses encoding specified by "use open".
   open(my $fh, '<', 'foo.txt') or die $!;
   chomp( my $got = <$fh> );
   printf("U+%v04X %s\n", $got, $got);
}

{
   # :raw overrides default encoding specified above
   # since encode_json already encodes using UTF-8
   open(my $fh, '>:raw', 'foo.json') or die $!;
   print($fh encode_json( { text => $s } ));
}

{
   my $json = do {
      # Similarly, decode_json expects UTF-8.
      open(my $fh, '<:raw', 'foo.json') or die $!;
      local $/;
      <$fh>
   };
   my $got = decode_json($json)->{text};
   printf("U+%v04X %s\n", $got, $got);
}
[download]

In reply to Re: The Queensrÿche Situation by ikegami
in thread The Queensrÿche Situation by Rodster001

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


There's more than one way to do things
	PerlMonks