#!perl
use strict;
use warnings;
use autodie;
use Unicode::UCD qw( charinfo );
open my $input_fh, '<:encoding(UTF-8)', 'DriedMangos.txt';
while (my $line = <$input_fh>) {
chomp $line;
while ($line =~ m/(.)/g) {
my $character = $1;
my $codepoint = ord $character;
my $charinfo = charinfo($codepoint);
my $code = "U+$charinfo->{'code'}";
my $name = $charinfo->{'name'};
print "$code $name\n";
}
print "\n";
}
close $input_fh;
The output of the script:
U+0064 LATIN SMALL LETTER D
U+0072 LATIN SMALL LETTER R
U+0069 LATIN SMALL LETTER I
U+0065 LATIN SMALL LETTER E
U+0064 LATIN SMALL LETTER D
U+0020 SPACE
U+006D LATIN SMALL LETTER M
U+0061 LATIN SMALL LETTER A
U+006E LATIN SMALL LETTER N
U+0067 LATIN SMALL LETTER G
U+006F LATIN SMALL LETTER O
U+0073 LATIN SMALL LETTER S
U+006D LATIN SMALL LETTER M
U+0061 LATIN SMALL LETTER A
U+006E LATIN SMALL LETTER N
U+0067 LATIN SMALL LETTER G
U+0075 LATIN SMALL LETTER U
U+0065 LATIN SMALL LETTER E
U+0073 LATIN SMALL LETTER S
U+0020 SPACE
U+0073 LATIN SMALL LETTER S
U+0065 LATIN SMALL LETTER E
U+0301 COMBINING ACUTE ACCENT
U+0063 LATIN SMALL LETTER C
U+0068 LATIN SMALL LETTER H
U+0065 LATIN SMALL LETTER E
U+0301 COMBINING ACUTE ACCENT
U+0065 LATIN SMALL LETTER E
U+0073 LATIN SMALL LETTER S
U+8292 CJK UNIFIED IDEOGRAPH-8292
U+679C CJK UNIFIED IDEOGRAPH-679C
U+5E79 CJK UNIFIED IDEOGRAPH-5E79
U+0064 LATIN SMALL LETTER D
U+006F LATIN SMALL LETTER O
U+0072 LATIN SMALL LETTER R
U+0061 LATIN SMALL LETTER A
U+0069 LATIN SMALL LETTER I
U+0064 LATIN SMALL LETTER D
U+006F LATIN SMALL LETTER O
U+0020 SPACE
U+006D LATIN SMALL LETTER M
U+0061 LATIN SMALL LETTER A
U+006E LATIN SMALL LETTER N
U+0067 LATIN SMALL LETTER G
U+006F LATIN SMALL LETTER O
U+0304 COMBINING MACRON
U+0073 LATIN SMALL LETTER S
U+0075 LATIN SMALL LETTER U
U+30C9 KATAKANA LETTER DO
U+30E9 KATAKANA LETTER RA
U+30A4 KATAKANA LETTER I
U+30C9 KATAKANA LETTER DO
U+30DE KATAKANA LETTER MA
U+30F3 KATAKANA LETTER N
U+30B4 KATAKANA LETTER GO
U+30B9 KATAKANA LETTER SU
U+30C8 KATAKANA LETTER TO
U+3099 COMBINING KATAKANA-HIRAGANA VOICED SOUND MARK
U+30E9 KATAKANA LETTER RA
U+30A4 KATAKANA LETTER I
U+30C8 KATAKANA LETTER TO
U+3099 COMBINING KATAKANA-HIRAGANA VOICED SOUND MARK
U+30DE KATAKANA LETTER MA
U+30F3 KATAKANA LETTER N
U+30B3 KATAKANA LETTER KO
U+3099 COMBINING KATAKANA-HIRAGANA VOICED SOUND MARK
U+30B9 KATAKANA LETTER SU
U+30C8 KATAKANA LETTER TO
U+0022 QUOTATION MARK
U+30E9 KATAKANA LETTER RA
U+30A4 KATAKANA LETTER I
U+30C8 KATAKANA LETTER TO
U+0022 QUOTATION MARK
U+30DE KATAKANA LETTER MA
U+30F3 KATAKANA LETTER N
U+30B3 KATAKANA LETTER KO
U+0022 QUOTATION MARK
U+30B9 KATAKANA LETTER SU
The Latin characters with diacritics are in Unicode Normalization Form D (NFD). The katakana characters on the fifth line are in Unicode Normalization Form C (NFC). The same katakana characters on the sixth line are in NFD.
|