Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Re^2: Best Way to Get Length of UTF-8 String in Bytes?

by Jim (Curate)
on Apr 24, 2011 at 01:41 UTC ( #901005=note: print w/ replies, xml ) Need Help??


in reply to Re: Best Way to Get Length of UTF-8 String in Bytes?
in thread Best Way to Get Length of UTF-8 String in Bytes?

Thank you, ikegami.

Here's what I had tried before posting my inquiry:

#!perl

use strict;
use warnings;
use open qw( :utf8 :std );
use utf8;

# 'China' in Simplified Chinese
#          中        国
# Unicode  U+4E2D    U+56FD
# UTF-8    E4 B8 AD  E5 9B BD

my $text = '中国';
my $length_in_characters = length $text;
print "Length of text '$text' in characters is $length_in_characters\n";

{
    use bytes;
    my $length_in_bytes = length $text;
    print "Length of text '$text' in bytes is $length_in_bytes\n";
}

{
    require Encode;
    my $bytes = Encode::encode_utf8($text);
    my $length_in_bytes = length $bytes;
    print "Length of text '$bytes' in bytes is $length_in_bytes\n";
}

And here's its output:

Length of text '中国' in characters is 2
Length of text '中›' in bytes is 6
Length of text '中›' in bytes is 6

(I couldn't use <code> tags here due to the Chinese characters in both the script and its output.)

Jim


Comment on Re^2: Best Way to Get Length of UTF-8 String in Bytes?
Re^3: Best Way to Get Length of UTF-8 String in Bytes?
by ikegami (Pope) on Apr 24, 2011 at 03:19 UTC

    Are you trying to suggest you could use bytes? That would be incorrect. bytes does not give UTF-8, it gives the internal storage format of the string. That may be utf8 (similiar to UTF-8) or just bytes. Here's an example of it giving the incorrect answer:

    #!perl use strict; use warnings; use open qw( :encoding(cp437) :std ); use utf8; my $text = chr(0xC9); my $length_in_characters = length $text; print "Length of text '$text' in characters is $length_in_characters\n +"; { use bytes; my $length_in_bytes = length $text; print "Length of text '$text' in bytes is $length_in_bytes\n"; } { require Encode; my $bytes = Encode::encode_utf8($text); my $length_in_bytes = length $bytes; print "Length of text '$bytes' in bytes is $length_in_bytes\n"; }
    Length of text '' in characters is 1 Length of text '' in bytes is 1 "\x{00c3}" does not map to cp437 at a.pl line 22. "\x{0089}" does not map to cp437 at a.pl line 22. Length of text '\x{00c3}\x{0089}' in bytes is 2
      I dont know what all that Microsoft noise was for nor the use utf8 either for that matter but were all perfectly familiar with the Unicode bug thank you very much.

      And we are also aware of how unlikely it is to a problem for Jim given the data samples he displayed.

      % perl -CS -E 'say chr(0xe9)' | perl -CS -nE 'require bytes; say byte +s::length($_); chomp; say bytes::length($_)' 3 2 % perl -E '$x = "\x{e9}\x{3b1}"; require bytes; say bytes::length($x); + chop $x; say bytes::length($x)' 4 2 % perl -E '$x = "\N{U+E9}"; require bytes; say bytes::length($x)' 2
      As you can plainly see, its only your own isolated little byte constants that can switch internal representation. All you have to do is ever once have a code point greater than 255 anywhere in the string and it stops being a byte string. You also wont have a problem if youve read in the utf8 from something whose encoding layer is set to utf8. So if he has either of those in his program which it looks like he does he can ignore Chicken Little.

      It wont bother him. Ill bet.

        I dont know what all that Microsoft noise was for

        My terminal uses cp437, and the garbage of encoding UTF-8 was there in the OP's output too. It just looks a bit different on my terminal ('中国 vs \x{00c3}\x{0089}).

        nor the use utf8 either for that matte

        Are you suggesting I should have made irrelevant changes to the OP's code?

        And we are also aware of how unlikely it is to a problem for Jim given the data samples he displayed.

        What do you mean unlikely? I'd say it's impossible since those characters are above U+00FF.

        But so what. He's not going to deal with only those two characters.

        I don't get it. In one breath, you say he should handle NFD. In the next, you say I should only concern myself with the characters he posted.

        I would agree, the perl implementation is documented to use UTF-8 encoding for one of the two options, and 8-bit chars for the other. It is also explained when each occurs and how they are handled during concatenation, with various options.

        Certainly is is less problematic and more maintainable to not count on any subtle details that might shift the meaning.

        Hmm, just what is the 8-bit form? If it's "whatever was read in", it might include characters encoded in multiple bytes, using some other code page. So, I would be inclined to feel safe treating the internal length in bytes as the UTF-8 length if I read in the string from a file using UTF-8 encoding, or it was a string literal in a program whose source file used utf8. I think there is also a utility function somewhere to tell you which mode a string is in.

        In fact, wouldn't the UTF-8 encoder just check that flag first and realize it's a no-op? So using it would be efficient, if you don't mind copying the string.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://901005]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others about the Monastery: (8)
As of 2014-12-25 21:02 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (163 votes), past polls