Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Re^3: Unicode: Perl5 equivalent to Perl6's @string.graphemes?

by Jim (Curate)
on Nov 13, 2010 at 05:48 UTC ( [id://871208]=note: print w/replies, xml ) Need Help??


in reply to Re^2: Unicode: Perl5 equivalent to Perl6's @string.graphemes?
in thread Unicode: Perl5 equivalent to Perl6's @string.graphemes?

Perl script to display the contents of the UTF-8 text file named DriedMangos.txt as a list of Unicode code points and character names:

#!perl use strict; use warnings; use autodie; use Unicode::UCD qw( charinfo ); open my $input_fh, '<:encoding(UTF-8)', 'DriedMangos.txt'; while (my $line = <$input_fh>) { chomp $line; while ($line =~ m/(.)/g) { my $character = $1; my $codepoint = ord $character; my $charinfo = charinfo($codepoint); my $code = "U+$charinfo->{'code'}"; my $name = $charinfo->{'name'}; print "$code $name\n"; } print "\n"; } close $input_fh;

The output of the script:

U+0064 LATIN SMALL LETTER D U+0072 LATIN SMALL LETTER R U+0069 LATIN SMALL LETTER I U+0065 LATIN SMALL LETTER E U+0064 LATIN SMALL LETTER D U+0020 SPACE U+006D LATIN SMALL LETTER M U+0061 LATIN SMALL LETTER A U+006E LATIN SMALL LETTER N U+0067 LATIN SMALL LETTER G U+006F LATIN SMALL LETTER O U+0073 LATIN SMALL LETTER S U+006D LATIN SMALL LETTER M U+0061 LATIN SMALL LETTER A U+006E LATIN SMALL LETTER N U+0067 LATIN SMALL LETTER G U+0075 LATIN SMALL LETTER U U+0065 LATIN SMALL LETTER E U+0073 LATIN SMALL LETTER S U+0020 SPACE U+0073 LATIN SMALL LETTER S U+0065 LATIN SMALL LETTER E U+0301 COMBINING ACUTE ACCENT U+0063 LATIN SMALL LETTER C U+0068 LATIN SMALL LETTER H U+0065 LATIN SMALL LETTER E U+0301 COMBINING ACUTE ACCENT U+0065 LATIN SMALL LETTER E U+0073 LATIN SMALL LETTER S U+8292 CJK UNIFIED IDEOGRAPH-8292 U+679C CJK UNIFIED IDEOGRAPH-679C U+5E79 CJK UNIFIED IDEOGRAPH-5E79 U+0064 LATIN SMALL LETTER D U+006F LATIN SMALL LETTER O U+0072 LATIN SMALL LETTER R U+0061 LATIN SMALL LETTER A U+0069 LATIN SMALL LETTER I U+0064 LATIN SMALL LETTER D U+006F LATIN SMALL LETTER O U+0020 SPACE U+006D LATIN SMALL LETTER M U+0061 LATIN SMALL LETTER A U+006E LATIN SMALL LETTER N U+0067 LATIN SMALL LETTER G U+006F LATIN SMALL LETTER O U+0304 COMBINING MACRON U+0073 LATIN SMALL LETTER S U+0075 LATIN SMALL LETTER U U+30C9 KATAKANA LETTER DO U+30E9 KATAKANA LETTER RA U+30A4 KATAKANA LETTER I U+30C9 KATAKANA LETTER DO U+30DE KATAKANA LETTER MA U+30F3 KATAKANA LETTER N U+30B4 KATAKANA LETTER GO U+30B9 KATAKANA LETTER SU U+30C8 KATAKANA LETTER TO U+3099 COMBINING KATAKANA-HIRAGANA VOICED SOUND MARK U+30E9 KATAKANA LETTER RA U+30A4 KATAKANA LETTER I U+30C8 KATAKANA LETTER TO U+3099 COMBINING KATAKANA-HIRAGANA VOICED SOUND MARK U+30DE KATAKANA LETTER MA U+30F3 KATAKANA LETTER N U+30B3 KATAKANA LETTER KO U+3099 COMBINING KATAKANA-HIRAGANA VOICED SOUND MARK U+30B9 KATAKANA LETTER SU U+30C8 KATAKANA LETTER TO U+0022 QUOTATION MARK U+30E9 KATAKANA LETTER RA U+30A4 KATAKANA LETTER I U+30C8 KATAKANA LETTER TO U+0022 QUOTATION MARK U+30DE KATAKANA LETTER MA U+30F3 KATAKANA LETTER N U+30B3 KATAKANA LETTER KO U+0022 QUOTATION MARK U+30B9 KATAKANA LETTER SU

The Latin characters with diacritics are in Unicode Normalization Form D (NFD). The katakana characters on the fifth line are in Unicode Normalization Form C (NFC). The same katakana characters on the sixth line are in NFD.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://871208]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others romping around the Monastery: (4)
As of 2024-04-24 06:39 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found