Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

How to print the actual bytes of UTF-8 characters ?

by RCH (Acolyte)
on Feb 06, 2014 at 14:41 UTC ( #1073713=perlquestion: print w/ replies, xml ) Need Help??
RCH has asked for the wisdom of the Perl Monks concerning the following question:

Dear PerlMonks

I am trying to make myself a table of Unicode (UTF-8) characters with their decimal, hex, binary and byte equivalents

Here is what a bit of it should look like

A. ... B. 208 209 210 211 ... C. d0 d1 d2 d3 ... D. 11010000 11010001 11010010 11010011 ... E. c3 90 c3 91 c3 92 c3 93 ... F. 11000011 11000011 11000011 11000011 ... G. 10010000 10010001 10010010 10010011 ...

I know how to make rows A. B. C. and D. 1
How do I generate lines E. F. and G. in Perl?

RichardH

1 (using sprintf in a loop -

A: "%s", chr($n); B: "%d",$n; C: "%x",$n; D: "%b",$n;

)

Comment on How to print the actual bytes of UTF-8 characters ?
Select or Download Code
Re: How to print the actual bytes of UTF-8 characters ?
by choroba (Abbot) on Feb 06, 2014 at 15:01 UTC
    Using a variable as a file to handle the encodings:
    #!/usr/bin/perl use warnings; use strict; use utf8; for my $char (qw( )) { my $n = ord $char; open my $BYTE, '>:utf8', \ my $bytes; print {$BYTE} $char; printf "%s\t%s\t%x\t%b\t%x %x\t %b %b\n", $char, $n, $n, $n, (unpack('CC', $bytes)) x 2; }

    The pivoting of the table left as an exercise to the reader.

    لսႽ ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ

      Magic!
      Could you explain how it works?
      I had tried a simple minded unpack('C', $char) but it gave me the wrong answer.
      There are two things that I dont understand in your unpack solution
      (1) what are the contents of $bytes, and
      (2) what is the function of the slash "\" in

      open my $BYTE, '>:utf8', \ my $bytes;

      ?

        \ is the reference operator. Instead of using a file, I open the variable for output (see FILEHANDLE, MODE, REFERENCE in open). I set its encoding to UTF-8 and print the character to it. $bytes now contains the two bytes of the character as encoded in UTF-8.
        لսႽ ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ
      open my $BYTE, '>:utf8', \my $bytes; print {$BYTE} $char;? utf8::encode(my $bytes = $char);!
Re: How to print the actual bytes of UTF-8 characters ?
by atcroft (Monsignor) on Feb 06, 2014 at 16:19 UTC

    For simplicity, I would deal with the integer values. (The following was adapted from a one-liner, and made heavy use of Tom Christiansen's Unicode articles on perl.com.)

    (I don't work with Unicode often, but there was an error thrown when I tried using 0xD800 as a character. I seem to remember there are some ranges that may not be defined, so adding anything above 0xD799 and formatting changes are left as exercises for the reader.)

    Hope that helps.

    Update: 2014-02-06

    This article shed light on why 0xD800-0xDFFF are considered invalid. The code above was updated to skip said range.

    Update: 2014-02-06

    Remove left-over debug print(); add eval() around print to catch invalid Unicode points.

Re: How to print the actual bytes of UTF-8 characters ?
by andal (Friar) on Feb 07, 2014 at 08:37 UTC

    Your question is somewhat confusing. What is decimal value of "Unicode(UTF-8) character"? There is Unicode standard that assigns 32-bit number to image of a character. There's UTF-8 encoding that can be used to represent 32-bit number as sequence of bytes that is backward compatible with ASCII only text. So, do you want to have decimal value from Unicode standard, or decimal value of the sequence of bytes from UTF-8 encoding? The latter one does not really make sense since number of bytes can be 3 or 5 and then you'd have to decide how you create "decimal" from it.

    Assuming that you want codes assigned by Unicode standard and the bytes used to represent those codes as UTF-8 sequence. Also assuming that you start from codes. Then the following can be used

    use utf8; use Encode; my $code = 208; # the unicode expressed as decimal my $char = chr($code); # convert to internal perl character my $utf8_octets = encode("UTF-8", $ch); # get sequence of bytes in UTF +-8 print sprintf("Decimal: %d, Hex: %x, Bits: %b\n", $code, $code, $code) +; print "UTF-8 hex: ", unpack("H*", $utf8_octets), "\n"; print "UTF-8 bits: ", unpack("B*", $utf8_octets), "\n";
    This is the same as what choroba has offered, just using different way, without file handles and redirection.

    If you want to go from characters to codes, then you get $code via ord($char).

Re: How to print the actual bytes of UTF-8 characters ?
by pajout (Curate) on Feb 07, 2014 at 13:09 UTC
    I think you need something like this:
    #!/usr/bin/perl use utf8; my $str = ' '; print $str."\n"; foreach my $ch (split('', $str)) { print ord($ch)."\n"; } use bytes; print "bytes\n"; foreach my $ch (split('', $str)) { printf("%x %b\n", ord($ch), ord($ch)); }
Re: How to print the actual bytes of UTF-8 characters ?
by Jim (Curate) on Feb 07, 2014 at 19:06 UTC

    I've always found unpack() and bit manipulation confusing. Here's my variation on the theme that uses ord() and sprintf() instead of unpack(). This script takes advantage of the fact that Unicode::UCD::charinfo() returns undef for unassigned code points and non-characters.

    Jim

    Update:  Here's a revised version of the script that handles surrogate code points more appropriately. And for comparison, I've used unpack('C*', ...). ☺

    #!perl use strict; use warnings; use v5.12; use Encode qw( encode_utf8 ); use English qw( -no_match_vars ); use Unicode::UCD qw( charinfo ); binmode STDOUT, ':encoding(UTF-8)'; # Include a Unicode byte order mark in the output... print "\x{FEFF}"; local $OUTPUT_AUTOFLUSH = 1; local $OUTPUT_RECORD_SEPARATOR = "\n"; local $OUTPUT_FIELD_SEPARATOR = "\t"; CODE: for my $code (0x000000 .. 0x10FFFF) { # Look up the code point in the Unicode Character Database... my $charinfo = charinfo($code); # Skip unassigned code points and non-characters... next CODE unless defined $charinfo; my $codepoint = sprintf 'U+%06X', $code; my $character = chr $code; my $name = $charinfo->{'name'}; my $category = $charinfo->{'category'}; my $block = $charinfo->{'block'}; my $script = $charinfo->{'script'}; my @utf8_octets = unpack 'C*', encode_utf8($character); my $utf8_hex_string = join ' ', map { sprintf '%02X', $ARG } @utf8_octets; my $utf8_bin_string = join ' ', map { sprintf '%08b', $ARG } @utf8_octets; # Don't try to print unprintable or private use characters... if ($category =~ m/^C[cfos]$/) { $character = ''; # Don't falsely represent surrogates as valid UTF-8... if ($category eq 'Cs') { $utf8_hex_string = $utf8_bin_string = ''; } } print $character, $code, $codepoint, $utf8_hex_string, $utf8_bin_string, $name, $category, $block, $script; } exit 0;

    Another update:  I removed this…

    # Don't complain about surrogates... no warnings qw( surrogate );

    …from the script because I realized it's not doing anything. I'm already skipping trying to print surrogates later in the script, so suppressing warnings about them isn't necessary.

Re: How to print the actual bytes of UTF-8 characters ?
by ikegami (Pope) on Feb 07, 2014 at 21:06 UTC

    Use builtin utf8::encode or core Encode::encode_utf8 to get the UTF-8 encoding.

    use utf8; # Source code is encoded using UTF-8. use open ':std', ':locale'; # Decode inputs and encode inputs. use strict; use warnings; use feature qw( say ); my @chars; for my $char (qw( )) { my $cp = ord($char); # Or unpack 'C' my $utf8 = $char; utf8::encode($utf8); my @utf8 = unpack('C*', $utf8); push @chars, [ $char, $cp, $utf8, @utf8 ]; }

    Then it's just a question of displaying correctly.

    my $last = 0; for (@chars) { $last = $#$_ if $last < $#$_; } say join ' ', map { sprintf '%-8s', $_->[0] } +@chars; say join ' ', map { sprintf '%-8d', $_->[1] } +@chars; say join ' ', map { sprintf '%-8s', sprintf '%02x', $_->[1] } +@chars; say join ' ', map { sprintf '%08b', $_->[1] } +@chars; say join ' ', map { sprintf '%-8s', sprintf '%*v02x', ' ', $_->[2] } +@chars; for my $i (3..$last) { say join ' ', map { defined($_->[$i]) ? sprintf '%08b', $_->[$i] : + (' 'x8) } @chars; }

    Notes:

    • The binary of the code point could take up to 21 characters, but only 8 are available.
    • The hex of the UTF-8 bytes could take up to 11 chars, but only 8 are available.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1073713]
Approved by Ratazong
Front-paged by Jim
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others lurking in the Monastery: (8)
As of 2014-09-24 05:56 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (246 votes), past polls