Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

Re: length() miscounting UTF8 characters?

by Jim (Curate)
on Apr 28, 2014 at 02:50 UTC ( #1084067=note: print w/ replies, xml ) Need Help??


in reply to length() miscounting UTF8 characters?

Here's a Perl script that counts the number of bytes, code points, and graphemes in each UTF-8 encoded word. It also tallies the code points by Unicode blocks.

#!perl

use v5.14;
use strict;
use warnings;
use utf8;

use Encode qw( encode_utf8 );
use Unicode::UCD qw( charblock );

binmode STDOUT, ':encoding(UTF-8)';

while (my $word = <DATA>) {
    chomp $word;

    my $length_in_bytes       = length_in_bytes($word);
    my $length_in_code_points = length_in_code_points($word);
    my $length_in_graphemes   = length_in_graphemes($word);
    my $code_points_in_blocks = code_points_in_blocks($word);

    printf "%-12s | Bytes: %2d | Code Points: %2d | Graphemes: %2d | Blocks: %s\n",
        $word,
        $length_in_bytes,
        $length_in_code_points,
        $length_in_graphemes,
        $code_points_in_blocks;
}

exit 0;

sub length_in_bytes {
    my $word = shift;

    my $length = length encode_utf8($word);

    return $length;
}

sub length_in_code_points {
    my $word = shift;

    my $length = length $word;

    return $length;
}

sub length_in_graphemes {
    my $word = shift;

    my $length = () = $word =~ m/\X/g;

    return $length;
}

sub code_points_in_blocks {
    my $word = shift;

    my %total_code_points_by;
    my $blocks = '';

    for my $character (split m//, $word) {
        my $block = charblock(ord $character);

        $total_code_points_by{$block}++;
    }

    for my $block (sort keys %total_code_points_by) {
        my $total = $total_code_points_by{$block};

        $blocks .= sprintf "%s%s (%d)",
                   (length $blocks ? ', ' : ''), $block, $total;
    }

    return $blocks;
}

__DATA__
Š
Š­
Š­a
Š­aber
Š­ahn˙tur
Š­ak÷lkun
Š­ard˙nn
Š­arfugl
Š­arkolla
Š­arkˇngur
Š­arvarp
Š­i
Š­imargur
Š­isgenginn
Š­iskast
Š­islegur
Š­rast
Š­ri
Š­rulaus
Š­ruleysi
Š­ruor­
Š­rutˇnn
Š­stur
Š­ur
Šfa

Here's the output of the script.

Š            | Bytes:  2 | Code Points:  1 | Graphemes:  1 | Blocks: Latin-1 Supplement (1)
Š­           | Bytes:  4 | Code Points:  2 | Graphemes:  2 | Blocks: Latin-1 Supplement (2)
Š­a          | Bytes:  5 | Code Points:  3 | Graphemes:  3 | Blocks: Basic Latin (1), Latin-1 Supplement (2)
Š­aber       | Bytes:  8 | Code Points:  6 | Graphemes:  6 | Blocks: Basic Latin (4), Latin-1 Supplement (2)
Š­ahn˙tur    | Bytes: 12 | Code Points:  9 | Graphemes:  9 | Blocks: Basic Latin (6), Latin-1 Supplement (3)
Š­ak÷lkun    | Bytes: 12 | Code Points:  9 | Graphemes:  9 | Blocks: Basic Latin (6), Latin-1 Supplement (3)
Š­ard˙nn     | Bytes: 11 | Code Points:  8 | Graphemes:  8 | Blocks: Basic Latin (5), Latin-1 Supplement (3)
Š­arfugl     | Bytes: 10 | Code Points:  8 | Graphemes:  8 | Blocks: Basic Latin (6), Latin-1 Supplement (2)
Š­arkolla    | Bytes: 11 | Code Points:  9 | Graphemes:  9 | Blocks: Basic Latin (7), Latin-1 Supplement (2)
Š­arkˇngur   | Bytes: 13 | Code Points: 10 | Graphemes: 10 | Blocks: Basic Latin (7), Latin-1 Supplement (3)
Š­arvarp     | Bytes: 10 | Code Points:  8 | Graphemes:  8 | Blocks: Basic Latin (6), Latin-1 Supplement (2)
Š­i          | Bytes:  5 | Code Points:  3 | Graphemes:  3 | Blocks: Basic Latin (1), Latin-1 Supplement (2)
Š­imargur    | Bytes: 11 | Code Points:  9 | Graphemes:  9 | Blocks: Basic Latin (7), Latin-1 Supplement (2)
Š­isgenginn  | Bytes: 13 | Code Points: 11 | Graphemes: 11 | Blocks: Basic Latin (9), Latin-1 Supplement (2)
Š­iskast     | Bytes: 10 | Code Points:  8 | Graphemes:  8 | Blocks: Basic Latin (6), Latin-1 Supplement (2)
Š­islegur    | Bytes: 11 | Code Points:  9 | Graphemes:  9 | Blocks: Basic Latin (7), Latin-1 Supplement (2)
Š­rast       | Bytes:  8 | Code Points:  6 | Graphemes:  6 | Blocks: Basic Latin (4), Latin-1 Supplement (2)
Š­ri         | Bytes:  6 | Code Points:  4 | Graphemes:  4 | Blocks: Basic Latin (2), Latin-1 Supplement (2)
Š­rulaus     | Bytes: 10 | Code Points:  8 | Graphemes:  8 | Blocks: Basic Latin (6), Latin-1 Supplement (2)
Š­ruleysi    | Bytes: 11 | Code Points:  9 | Graphemes:  9 | Blocks: Basic Latin (7), Latin-1 Supplement (2)
Š­ruor­      | Bytes: 10 | Code Points:  7 | Graphemes:  7 | Blocks: Basic Latin (4), Latin-1 Supplement (3)
Š­rutˇnn     | Bytes: 11 | Code Points:  8 | Graphemes:  8 | Blocks: Basic Latin (5), Latin-1 Supplement (3)
Š­stur       | Bytes:  8 | Code Points:  6 | Graphemes:  6 | Blocks: Basic Latin (4), Latin-1 Supplement (2)
Š­ur         | Bytes:  6 | Code Points:  4 | Graphemes:  4 | Blocks: Basic Latin (2), Latin-1 Supplement (2)
Šfa          | Bytes:  4 | Code Points:  3 | Graphemes:  3 | Blocks: Basic Latin (2), Latin-1 Supplement (1)

UPDATE:  If you add these three words to the end of the list in the __DATA__ block of the the UTF-8 encoded Perl script…

한국말
pi˝ˇn
piñón

…then the report will include these three lines…

한국말          | Bytes:  9 | Code Points:  3 | Graphemes:  3 | Blocks: Hangul Syllables (3)
pi˝ˇn        | Bytes:  7 | Code Points:  5 | Graphemes:  5 | Blocks: Basic Latin (3), Latin-1 Supplement (2)
piñón      | Bytes:  9 | Code Points:  7 | Graphemes:  5 | Blocks: Basic Latin (5), Combining Diacritical Marks (2)


Comment on Re: length() miscounting UTF8 characters?
Download Code
Re^2: length() miscounting UTF8 characters?
by AppleFritter (Chaplain) on Apr 28, 2014 at 09:37 UTC
    Wow, I don't know what to say, that script is extremely helpful and should come in very handy! Thanks a bunch, I really appreciate the effort you went to there. I never expected this much useful feedback when I turned to PM at a friend's suggestion. So again, thanks to you and everyone else, I'm really impressed.

      Bear in mind that the script is written using very didactic code. It's longer and more verbose than the same script would be if its main purpose wasn't to teach a lesson.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1084067]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others pondering the Monastery: (17)
As of 2014-07-29 14:14 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (217 votes), past polls