http://www.perlmonks.org?node_id=11130172

Perlian has asked for the wisdom of the Perl Monks concerning the following question:

Hi Friends, as you may know, there is that code-block »Mathematical Alphanumeric Symbols« U+1D400..U+1D7FF, containing styled letters and digits that look like normal characters from the latin alphabet, just styled in bold or italic available in UniCode. Now i tried to use a simple transformation operation to transform some normal text into "bold" UniCode text and as naive as i am i did this:
my $CharSet = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ012 +3456789'; # ASCII my $BoldSet = '𝐚𝐛𝐜𝐝𝐞𝐟& +#119840;𝐡𝐢𝐣𝐤𝐥𝐦&#11984 +7;𝐨𝐩𝐪𝐫𝐬𝐭𝐮&#11 +9855;𝐰𝐱𝐲𝐳𝐀𝐁𝐂& +#119811;𝐄𝐅𝐆𝐇𝐈𝐉&#11981 +8;𝐋𝐌𝐍𝐎𝐏𝐐𝐑&#11 +9826;𝐓𝐔𝐕𝐖𝐗𝐘𝐙& +#120782;𝟏𝟐𝟑𝟒𝟓𝟔&#12078 +9;𝟖𝟗'; # UniCode bold my $Source = 'The quick brown fox jumps over the lazy dog 1234567890 t +imes.'; my $Target = $Source; $Target =~ tr/$CharSet/$BoldSet/; print "$Source\n$Target\n";
To my surprise, the output was this:
The quick brown fox jumps over the lazy dog 1234567890 times. Toe quick bdown fox jumps oved toe llzy dog 1234567890 times.
No trace of bold UniCode characters, but some characters garbled. Does "tr" not work correctly with Unicode? I have a »use utf8::all;« in my program and i am using this perl version:
This is perl 5, version 26, subversion 3 (v5.26.3) built for x86_64-li +nux-thread-multi (with 51 registered patches, see perl -V for more detail)
Thank you very much in advance for your help. Best regards from Charleston (WV), Perlian

Replies are listed 'Best First'.
Re: Transform ASCII into UniCode
by choroba (Cardinal) on Mar 23, 2021 at 08:02 UTC
    If you want to use tr with dynamic strings (which is NOT the case here), you need to use string eval. Be sure to only use it for validated strings, never a random user input!
    #!/usr/bin/perl
    use warnings;
    use strict;
    use feature qw{ say };
    use utf8;
    
    use open OUT => ':encoding(UTF-8)', ':std';
    
    my $charset = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789';
    my $boldset = '𝐚𝐛𝐜𝐝𝐞𝐟𝐠𝐡𝐢𝐣𝐤𝐥𝐦𝐧𝐨𝐩𝐪𝐫𝐬𝐭𝐮𝐯𝐰𝐱𝐲𝐳𝐀𝐁𝐂𝐃𝐄𝐅𝐆𝐇𝐈𝐉𝐊𝐋𝐌𝐍𝐎𝐏𝐐𝐑𝐒𝐓𝐔𝐕𝐖𝐗𝐘𝐙𝟎𝟏𝟐𝟑𝟒𝟓𝟔𝟕𝟖𝟗';
    
    my $source = 'The quick brown fox jumps over the lazy dog 1234567890 times.';
    my $target = $source;
    eval "\$target =~ tr/$charset/$boldset/";
    
    say for $source, $target;
    

    map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
      > Be sure to only use it for validated strings, never a random user input!

      Here a generic routine to escape only selected meta-characters.

      Escaping any / (or other delimiter) from input should allow to safely apply

      eval "\$target =~ tr/$charset/$boldset/";

      use v5.12; use warnings; use Data::Dump qw/pp dd/; use Test::More; sub escape_metas { my ( $meta,$e ) = @_ ; $e //= '\\'; # default backslash my $ee ="\Q$e"; # don't mess my regex s[ (?| $ee($ee) # ignore double escapes | $ee($meta) # keep single escapes | ($meta) # escape meta ) ] [$e$1]xgr; } my $e = '\\'; # escape code my $m = '/'; # to be escaped for ("$m", "$e$e$m", "$e$e$e$e$m" ) { my $got = escape_metas($m,$e); is( $got, "$e$_" , "escaping $_ -> $got"); } for ("$e$m", "$e$e$e$m" ) { my $got = escape_metas($m,$e); is( $got, $_ , "ignoring $_ eq $got"); } done_testing;

      C:/Strawberry/perl/bin\perl.exe -w d:/tmp/pm/escapism.pl ok 1 - escaping / -> \/ ok 2 - escaping \\/ -> \\\/ ok 3 - escaping \\\\/ -> \\\\\/ ok 4 - ignoring \/ eq \/ ok 5 - ignoring \\\/ eq \\\/ 1..5

      Please tell me if I missed a case, tried to write it as generic as possible.

      EDIT

      More or betters tests are welcome too. =)

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      Wikisyntax for the Monastery

        I'm probably too busy today to understand. We wanted to escape the strings so they can be used in a transliteration, right? Why not test it directly, then?
        sub use_it { my ($string, $search, $replace) = @_; my ($s, $r); $s = escape_metas('/', '\\') for $search; $r = escape_metas('/', '\\') for $replace; return eval "\$string =~ tr/$s/$r/r" } sub cheat { my ($string, $search, $replace) = @_; return eval "\$string =~ tr|\Q$search\E|\Q$replace\E|r" } sub simulate { my ($string, $search, $replace) = @_; my $result = $string; for my $i (0 .. length($search) - 1) { my $from = substr $search, $i, 1; my $to = substr $replace, $i, 1; $result =~ s/\Q$from/$to/g; } return $result } for my $case ( # String search replace expect ['a/b' => 'a/b', 'xyz', 'xyz'], ['a\\b' => 'a\\b', 'xyz', 'xyz'], ['a/b' => '\\/', 'xy', 'ayb'], ['a\\/b' => '\\/', 'xy', 'axyb'], ['a/\\b' => '\\/', 'xy', 'ayxb'], ['a\\\\b' => '\\/', 'xy', 'axxb'], ['a\\\\/b' => '\\/', 'xy', 'axxyb'], ) { is simulate(@$case), $case->[-1], 'simulate'; is cheat(@$case), simulate(@$case), 'cheat'; is use_it(@$case), simulate(@$case), 'use'; }
        I'm not sure I got the "expect" right, but both "simulate" and "cheat" give the same results. "use", on the other hand, doesn't. I based it on your escape_metas - what did I do wrong?

        map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
Re: Transform ASCII into UniCode
by BillKSmith (Monsignor) on Mar 23, 2021 at 03:32 UTC
    From the documentation of tr:
    Characters may be literals, or (if the delimiters aren't single quotes) any of the escape sequences accepted in double-quoted strings. But there is never any variable interpolation, so "$" and "@" are always treated as literals.
    Bill
Re: Transform ASCII into UniCode
by GrandFather (Saint) on Mar 23, 2021 at 04:05 UTC

    a comment rather than an answer. Consider:

    use strict; use warnings; use Encode; binmode *STDOUT, 'utf8'; # Suppress "wide character" warnings my $CharSet = 'a'; # ASCII my $BoldSet = pack('U', 119834); # Unicode bold 'a' my $Source = 'a'; my $trTarget = $Source; my $reTarget = $Source; $trTarget =~ tr/$CharSet/$BoldSet/; $reTarget =~ s/$CharSet/$BoldSet/; print "$Source\n$trTarget\n$reTarget\n"; print $BoldSet;

    Prints:

    a l 𝐚 𝐚

    It seems tr/// isn't the right tool for the job. :-(

    Update: PerlMonks is screwing up the unicode characters. They render correctly when I paste them into the edit window, but are shown as code points when I submit the edit. Bugger.

    Optimising for fewest key strokes only makes sense transmitting to Pluto or beyond
Re: Transform ASCII into UniCode
by kcott (Archbishop) on Mar 24, 2021 at 19:48 UTC

    G'day Perlian,

    Here's a generic technique for dealing with this type of problem which doesn't require listing every character.

    $ perl -Mutf8 -C -E '
        my ($offset_0, $offset_A, $offset_a)
            = (ord("𝟎")-ord("0"), ord("𝐀")-ord("A"), ord("𝐚")-ord("a"));
        say "The quick brown fox jumps over the lazy dog 1234567890 times."
            =~ s/([0-9])/chr(ord($1)+$offset_0)/egr
            =~ s/([A-Z])/chr(ord($1)+$offset_A)/egr
            =~ s/([a-z])/chr(ord($1)+$offset_a)/egr;
    '
    𝐓𝐡𝐞 𝐪𝐮𝐢𝐜𝐤 𝐛𝐫𝐨𝐰𝐧 𝐟𝐨𝐱 𝐣𝐮𝐦𝐩𝐬 𝐨𝐯𝐞𝐫 𝐭𝐡𝐞 𝐥𝐚𝐳𝐲 𝐝𝐨𝐠 𝟏𝟐𝟑𝟒𝟓𝟔𝟕𝟖𝟗𝟎 𝐭𝐢𝐦𝐞𝐬.
    

    This should work fine with your 5.26.3 (I'm using 5.32.0). As general information: say requires 5.10 and /r requires 5.14.

    Two caveats:

    • Different Perl versions support different Unicode® versions: check you have a sufficiently high version of Perl to handle the Unicode characters you want to output (if in doubt, check the deltas).
    • Some alphabetical sequences in [PDF] "Mathematical Alphanumeric Symbols" have missing characters because they were defined in earlier versions. The first example in that block is U+1D44E (𝑎) to U+1D467 (𝑧) which has U+1D455 (<reserved>) because U+210E () was already defined in [PDF] "Letterlike Symbols" as PLANCK CONSTANT.

    Here's another example to show the generality of the solution. Only three characters were changed in the code to produce completely different output.

    $ perl -Mutf8 -C -E '
        my ($offset_0, $offset_A, $offset_a)
            = (ord("𝟘")-ord("0"), ord("𝕬")-ord("A"), ord("𝖆")-ord("a"));
        say "The quick brown fox jumps over the lazy dog 1234567890 times."
            =~ s/([0-9])/chr(ord($1)+$offset_0)/egr
            =~ s/([A-Z])/chr(ord($1)+$offset_A)/egr
            =~ s/([a-z])/chr(ord($1)+$offset_a)/egr;
    '
    𝕿𝖍𝖊 𝖖𝖚𝖎𝖈𝖐 𝖇𝖗𝖔𝖜𝖓 𝖋𝖔𝖝 𝖏𝖚𝖒𝖕𝖘 𝖔𝖛𝖊𝖗 𝖙𝖍𝖊 𝖑𝖆𝖟𝖞 𝖉𝖔𝖌 𝟙𝟚𝟛𝟜𝟝𝟞𝟟𝟠𝟡𝟘 𝖙𝖎𝖒𝖊𝖘.
    

    — Ken

Re: Transform ASCII into UniCode
by Perlian (Initiate) on Mar 23, 2021 at 21:18 UTC
    Thank you very much for all your answers, @choroba had the correct point: tr takes only literals for both character sets. Yes there are ways around that by using the `evil' eval, but that is just not necessary in my case: I just want to write a little function that accepts an ASCII string and returns a "bold" version of it. And yes, my terminal (MobaXterm) is capable to display a pretty good chunk of the UniCode charset, including the pseudo-bold or -italic block. Again, thank you all for guiding me back to the path of truth! 😋 Best regards from Charleston (WV), Perlian
Re: Transform ASCII into UniCode
by Polyglot (Chaplain) on Mar 23, 2021 at 03:52 UTC
    use utf8; use Encode qw(encode decode); my $CharSet = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ012 +3456789'; # ASCII my $BoldSet = encode('utf8','&#119834;&#119835;&#119836;&#119837;&#119 +838;&#119839;&#119840;&#119841;&#119842;&#119843;&#119844;&#119845;&# +119846;&#119847;&#119848;&#119849;&#119850;&#119851;&#119852;&#119853 +;&#119854;&#119855;&#119856;&#119857;&#119858;&#119859;&#119808;&#119 +809;&#119810;&#119811;&#119812;&#119813;&#119814;&#119815;&#119816;&# +119817;&#119818;&#119819;&#119820;&#119821;&#119822;&#119823;&#119824 +;&#119825;&#119826;&#119827;&#119828;&#119829;&#119830;&#119831;&#119 +832;&#119833;&#120782;&#120783;&#120784;&#120785;&#120786;&#120787;&# +120788;&#120789;&#120790;&#120791;'); my $Source = 'The quick brown fox jumps over the lazy dog 1234567890 t +imes.'; my $Target = $Source; $Target =~ tr/abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123 +456789/&#119834;&#119835;&#119836;&#119837;&#119838;&#119839;&#119840 +;&#119841;&#119842;&#119843;&#119844;&#119845;&#119846;&#119847;&#119 +848;&#119849;&#119850;&#119851;&#119852;&#119853;&#119854;&#119855;&# +119856;&#119857;&#119858;&#119859;&#119808;&#119809;&#119810;&#119811 +;&#119812;&#119813;&#119814;&#119815;&#119816;&#119817;&#119818;&#119 +819;&#119820;&#119821;&#119822;&#119823;&#119824;&#119825;&#119826;&# +119827;&#119828;&#119829;&#119830;&#119831;&#119832;&#119833;&#120782 +;&#120783;&#120784;&#120785;&#120786;&#120787;&#120788;&#120789;&#120 +790;&#120791;/; print "$Source\n$Target\n"; #The quick brown fox jumps over the lazy dog 1234567890 times. #&#119827;&#119841;&#119838; &#119850;&#119854;&#119842;&#119836;&#119 +844; &#119835;&#119851;&#119848;&#119856;&#119847; &#119839;&#119848; +&#119857; &#119843;&#119854;&#119846;&#119849;&#119852; &#119848;&#11 +9855;&#119838;&#119851; &#119853;&#119841;&#119838; &#119845;&#119834 +;&#119859;&#119858; &#119837;&#119848;&#119840; &#120783;&#120784;&#1 +20785;&#120786;&#120787;&#120788;&#120789;&#120790;&#120791;&#120782; + &#119853;&#119842;&#119846;&#119838;&#119852;.

    Blessings,

    ~Polyglot~