Transform ASCII into UniCode

Perlian has asked for the wisdom of the Perl Monks concerning the following question:

Hi Friends, as you may know, there is that code-block »Mathematical Alphanumeric Symbols« U+1D400..U+1D7FF, containing styled letters and digits that look like normal characters from the latin alphabet, just styled in bold or italic available in UniCode. Now i tried to use a simple transformation operation to transform some normal text into "bold" UniCode text and as naive as i am i did this:

my $CharSet = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ012
+3456789'; # ASCII
my $BoldSet = '&#119834;&#119835;&#119836;&#119837;&#119838;&#119839;&
+#119840;&#119841;&#119842;&#119843;&#119844;&#119845;&#119846;&#11984
+7;&#119848;&#119849;&#119850;&#119851;&#119852;&#119853;&#119854;&#11
+9855;&#119856;&#119857;&#119858;&#119859;&#119808;&#119809;&#119810;&
+#119811;&#119812;&#119813;&#119814;&#119815;&#119816;&#119817;&#11981
+8;&#119819;&#119820;&#119821;&#119822;&#119823;&#119824;&#119825;&#11
+9826;&#119827;&#119828;&#119829;&#119830;&#119831;&#119832;&#119833;&
+#120782;&#120783;&#120784;&#120785;&#120786;&#120787;&#120788;&#12078
+9;&#120790;&#120791;'; # UniCode bold

my $Source = 'The quick brown fox jumps over the lazy dog 1234567890 t
+imes.';
my $Target = $Source;
$Target =~ tr/$CharSet/$BoldSet/;

print "$Source\n$Target\n";
[download]

To my surprise, the output was this:

The quick brown fox jumps over the lazy dog 1234567890 times.
Toe quick bdown fox jumps oved toe llzy dog 1234567890 times.
[download]

No trace of bold UniCode characters, but some characters garbled. Does "tr" not work correctly with Unicode? I have a »use utf8::all;« in my program and i am using this perl version:

This is perl 5, version 26, subversion 3 (v5.26.3) built for x86_64-li
+nux-thread-multi
(with 51 registered patches, see perl -V for more detail)
[download]

Thank you very much in advance for your help. Best regards from Charleston (WV), Perlian

Comment on Transform ASCII into UniCode Select or Download Code

Replies are listed 'Best First'.
Re: Transform ASCII into UniCode by choroba (Cardinal) on Mar 23, 2021 at 08:02 UTC
If you want to use `tr` with dynamic strings (which is NOT the case here), you need to use string eval. Be sure to only use it for validated strings, never a random user input! #!/usr/bin/perl use warnings; use strict; use feature qw{ say }; use utf8; use open OUT => ':encoding(UTF-8)', ':std'; my $charset = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789'; my $boldset = '𝐚𝐛𝐜𝐝𝐞𝐟𝐠𝐡𝐢𝐣𝐤𝐥𝐦𝐧𝐨𝐩𝐪𝐫𝐬𝐭𝐮𝐯𝐰𝐱𝐲𝐳𝐀𝐁𝐂𝐃𝐄𝐅𝐆𝐇𝐈𝐉𝐊𝐋𝐌𝐍𝐎𝐏𝐐𝐑𝐒𝐓𝐔𝐕𝐖𝐗𝐘𝐙𝟎𝟏𝟐𝟑𝟒𝟓𝟔𝟕𝟖𝟗'; my $source = 'The quick brown fox jumps over the lazy dog 1234567890 times.'; my $target = $source; eval "\$target =~ tr/$charset/$boldset/"; say for $source, $target; `map{substr$_->[0],$_->[1]\|\|0,1}[\\|\|{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^ARGV,3]`	[reply] [d/l] [select]
Re^2: Transform ASCII into UniCode (escape_metas) by LanX (Saint) on Mar 23, 2021 at 16:43 UTC
> Be sure to only use it for validated strings, never a random user input! Here a generic routine to escape only selected meta-characters. Escaping any / (or other delimiter) from input should allow to safely apply `eval "\$target =~ tr/$charset/$boldset/";` use v5.12; use warnings; use Data::Dump qw/pp dd/; use Test::More; sub escape_metas { my ( $meta,$e ) = @_ ; $e //= '\\'; # default backslash my $ee ="\Q$e"; # don't mess my regex s[ (?\| $ee($ee) # ignore double escapes \| $ee($meta) # keep single escapes \| ($meta) # escape meta ) ] [$e$1]xgr; } my $e = '\\'; # escape code my $m = '/'; # to be escaped for ("$m", "$e$e$m", "$e$e$e$e$m" ) { my $got = escape_metas($m,$e); is( $got, "$e$_" , "escaping $_ -> $got"); } for ("$e$m", "$e$e$e$m" ) { my $got = escape_metas($m,$e); is( $got, $_ , "ignoring $_ eq $got"); } done_testing; [download] `C:/Strawberry/perl/bin\perl.exe -w d:/tmp/pm/escapism.pl ok 1 - escaping / -> \/ ok 2 - escaping \\/ -> \\\/ ok 3 - escaping \\\\/ -> \\\\\/ ok 4 - ignoring \/ eq \/ ok 5 - ignoring \\\/ eq \\\/ 1..5` [download] Please tell me if I missed a case, tried to write it as generic as possible. EDIT More or betters tests are welcome too. =) Cheers Rolf _{(addicted to the Perl Programming Language :) Wikisyntax for the Monastery}	[reply] [d/l] [select]
Re^3: Transform ASCII into UniCode (escape_metas) by choroba (Cardinal) on Mar 23, 2021 at 17:24 UTC
I'm probably too busy today to understand. We wanted to escape the strings so they can be used in a transliteration, right? Why not test it directly, then? sub use_it { my ($string, $search, $replace) = @_; my ($s, $r); $s = escape_metas('/', '\\') for $search; $r = escape_metas('/', '\\') for $replace; return eval "\$string =~ tr/$s/$r/r" } sub cheat { my ($string, $search, $replace) = @_; return eval "\$string =~ tr\|\Q$search\E\|\Q$replace\E\|r" } sub simulate { my ($string, $search, $replace) = @_; my $result = $string; for my $i (0 .. length($search) - 1) { my $from = substr $search, $i, 1; my $to = substr $replace, $i, 1; $result =~ s/\Q$from/$to/g; } return $result } for my $case ( # String search replace expect ['a/b' => 'a/b', 'xyz', 'xyz'], ['a\\b' => 'a\\b', 'xyz', 'xyz'], ['a/b' => '\\/', 'xy', 'ayb'], ['a\\/b' => '\\/', 'xy', 'axyb'], ['a/\\b' => '\\/', 'xy', 'ayxb'], ['a\\\\b' => '\\/', 'xy', 'axxb'], ['a\\\\/b' => '\\/', 'xy', 'axxyb'], ) { is simulate(@$case), $case->[-1], 'simulate'; is cheat(@$case), simulate(@$case), 'cheat'; is use_it(@$case), simulate(@$case), 'use'; } [download] I'm not sure I got the "expect" right, but both "simulate" and "cheat" give the same results. "use", on the other hand, doesn't. I based it on your escape_metas - what did I do wrong? `map{substr$_->[0],$_->[1]\|\|0,1}[\\|\|{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^ARGV,3]`	[reply] [d/l] [select]
Re^4: Transform ASCII into UniCode (escape_metas) by LanX (Saint) on Mar 23, 2021 at 18:18 UTC
Re^5: Transform ASCII into UniCode (escape_metas) by LanX (Saint) on Mar 23, 2021 at 19:08 UTC
Some notes below your chosen depth have not been shown here
Re: Transform ASCII into UniCode by BillKSmith (Monsignor) on Mar 23, 2021 at 03:32 UTC
From the documentation of tr: Characters may be literals, or (if the delimiters aren't single quotes) any of the escape sequences accepted in double-quoted strings. But there is never any variable interpolation, so "$" and "@" are always treated as literals. Bill	[reply]
Re: Transform ASCII into UniCode by GrandFather (Saint) on Mar 23, 2021 at 04:05 UTC
a comment rather than an answer. Consider: `use strict; use warnings; use Encode; binmode STDOUT, 'utf8'; # Suppress "wide character" warnings my $CharSet = 'a'; # ASCII my $BoldSet = pack('U', 119834); # Unicode bold 'a' my $Source = 'a'; my $trTarget = $Source; my $reTarget = $Source; $trTarget =~ tr/$CharSet/$BoldSet/; $reTarget =~ s/$CharSet/$BoldSet/; print "$Source\n$trTarget\n$reTarget\n"; print $BoldSet;` [download] Prints: `a l 𝐚 𝐚` [download] It seems tr/// isn't the right tool for the job. :-( Update:* PerlMonks is screwing up the unicode characters. They render correctly when I paste them into the edit window, but are shown as code points when I submit the edit. Bugger. Optimising for fewest key strokes only makes sense transmitting to Pluto or beyond	[reply] [d/l] [select]
Re^2: Transform ASCII into UniCode by Anonymous Monk on Mar 23, 2021 at 09:00 UTC
Update: PerlMonks is screwing up the unicode characters. They render correctly when I paste them into the edit window, but are shown as code points when I submit the edit. Bugger. Perlmonks doesn't unicode, perlmonks does windows-1252, your browser does conversion to windows-1252 ... and at some point html entities are used ...	[reply]
Re: Transform ASCII into UniCode by kcott (Archbishop) on Mar 24, 2021 at 19:48 UTC
G'day Perlian, Here's a generic technique for dealing with this type of problem which doesn't require listing every character. $ perl -Mutf8 -C -E ' my ($offset_0, $offset_A, $offset_a) = (ord("𝟎")-ord("0"), ord("𝐀")-ord("A"), ord("𝐚")-ord("a")); say "The quick brown fox jumps over the lazy dog 1234567890 times." =~ s/([0-9])/chr(ord($1)+$offset_0)/egr =~ s/([A-Z])/chr(ord($1)+$offset_A)/egr =~ s/([a-z])/chr(ord($1)+$offset_a)/egr; ' 𝐓𝐡𝐞 𝐪𝐮𝐢𝐜𝐤 𝐛𝐫𝐨𝐰𝐧 𝐟𝐨𝐱 𝐣𝐮𝐦𝐩𝐬 𝐨𝐯𝐞𝐫 𝐭𝐡𝐞 𝐥𝐚𝐳𝐲 𝐝𝐨𝐠 𝟏𝟐𝟑𝟒𝟓𝟔𝟕𝟖𝟗𝟎 𝐭𝐢𝐦𝐞𝐬. This should work fine with your `5.26.3` (I'm using `5.32.0`). As general information: `say` requires `5.10` and `/r` requires `5.14`. Two caveats: Different Perl versions support different Unicode® versions: check you have a sufficiently high version of Perl to handle the Unicode characters you want to output (if in doubt, check the deltas). Some alphabetical sequences in [PDF] "Mathematical Alphanumeric Symbols" have missing characters because they were defined in earlier versions. The first example in that block is `U+1D44E` (`𝑎`) to `U+1D467` (`𝑧`) which has `U+1D455` (`<reserved>`) because `U+210E` (`ℎ`) was already defined in [PDF] "Letterlike Symbols" as `PLANCK CONSTANT`. Here's another example to show the generality of the solution. Only three characters were changed in the code to produce completely different output. $ perl -Mutf8 -C -E ' my ($offset_0, $offset_A, $offset_a) = (ord("𝟘")-ord("0"), ord("𝕬")-ord("A"), ord("𝖆")-ord("a")); say "The quick brown fox jumps over the lazy dog 1234567890 times." =~ s/([0-9])/chr(ord($1)+$offset_0)/egr =~ s/([A-Z])/chr(ord($1)+$offset_A)/egr =~ s/([a-z])/chr(ord($1)+$offset_a)/egr; ' 𝕿𝖍𝖊 𝖖𝖚𝖎𝖈𝖐 𝖇𝖗𝖔𝖜𝖓 𝖋𝖔𝖝 𝖏𝖚𝖒𝖕𝖘 𝖔𝖛𝖊𝖗 𝖙𝖍𝖊 𝖑𝖆𝖟𝖞 𝖉𝖔𝖌 𝟙𝟚𝟛𝟜𝟝𝟞𝟟𝟠𝟡𝟘 𝖙𝖎𝖒𝖊𝖘. — Ken	[reply] [d/l] [select]
Re: Transform ASCII into UniCode by Perlian (Initiate) on Mar 23, 2021 at 21:18 UTC
Thank you very much for all your answers, @choroba had the correct point: tr takes only literals for both character sets. Yes there are ways around that by using the `evil' eval, but that is just not necessary in my case: I just want to write a little function that accepts an ASCII string and returns a "bold" version of it. And yes, my terminal (MobaXterm) is capable to display a pretty good chunk of the UniCode charset, including the pseudo-bold or -italic block. Again, thank you all for guiding me back to the path of truth! 😋 Best regards from Charleston (WV), Perlian	[reply]
Re: Transform ASCII into UniCode by Polyglot (Chaplain) on Mar 23, 2021 at 03:52 UTC
use utf8; use Encode qw(encode decode); my $CharSet = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ012 +3456789'; # ASCII my $BoldSet = encode('utf8','𝐚𝐛𝐜𝐝&#119 +838;𝐟𝐠𝐡𝐢𝐣𝐤𝐥&# +119846;𝐧𝐨𝐩𝐪𝐫𝐬&#119853 +;𝐮𝐯𝐰𝐱𝐲𝐳𝐀&#119 +809;𝐂𝐃𝐄𝐅𝐆𝐇𝐈&# +119817;𝐊𝐋𝐌𝐍𝐎𝐏&#119824 +;𝐑𝐒𝐓𝐔𝐕𝐖𝐗&#119 +832;𝐙𝟎𝟏𝟐𝟑𝟒𝟓&# +120788;𝟕𝟖𝟗'); my $Source = 'The quick brown fox jumps over the lazy dog 1234567890 t +imes.'; my $Target = $Source; $Target =~ tr/abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123 +456789/𝐚𝐛𝐜𝐝𝐞𝐟&#119840 +;𝐡𝐢𝐣𝐤𝐥𝐦𝐧&#119 +848;𝐩𝐪𝐫𝐬𝐭𝐮𝐯&# +119856;𝐱𝐲𝐳𝐀𝐁𝐂&#119811 +;𝐄𝐅𝐆𝐇𝐈𝐉𝐊&#119 +819;𝐌𝐍𝐎𝐏𝐐𝐑𝐒&# +119827;𝐔𝐕𝐖𝐗𝐘𝐙&#120782 +;𝟏𝟐𝟑𝟒𝟓𝟔𝟕&#120 +790;𝟗/; print "$Source\n$Target\n"; #The quick brown fox jumps over the lazy dog 1234567890 times. #𝐓𝐡𝐞 𝐪𝐮𝐢𝐜&#119 +844; 𝐛𝐫𝐨𝐰𝐧 𝐟𝐨 +𝐱 𝐣𝐮𝐦𝐩𝐬 𝐨&#11 +9855;𝐞𝐫 𝐭𝐡𝐞 𝐥&#119834 +;𝐳𝐲 𝐝𝐨𝐠 𝟏𝟐&#1 +20785;𝟒𝟓𝟔𝟕𝟖𝟗𝟎 + 𝐭𝐢𝐦𝐞𝐬. [download] Blessings, ~Polyglot~	[reply] [d/l]

Back to Seekers of Perl Wisdom