Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

uparse - Parse Unicode strings

by kcott (Archbishop)
on Nov 18, 2023 at 08:53 UTC ( [id://11155675]=CUFP: print w/replies, xml ) Need Help??

Improvement: See "Re: Decoding @ARGV [Was: uparse - Parse Unicode strings]" for an improved version of the code; mostly thanks to ++jo37 and the subthread starting with "Re: uparse - Parse Unicode strings" and continued in "Decoding @ARGV [Was: uparse - Parse Unicode strings]".

In the last month or so, we've had a number of threads where emoji were discussed. Some notable examples: "Larger profile pic than 80KB?"; "Perl Secret Operator Emojis"; and "Emojis for Perl Monk names".

Many emoji have embedded characters which are difficult, or impossible, to see; for example, zero-width joiners, variation selectors, skin tone modifiers. In some cases, glyphs are so similar that it's difficult to tell them apart; e.g. 🧑 & 👨.

I wrote uparse to split emoji, strings containing emoji, and in fact any strings with Unicode characters, into their component characters.

#!/usr/bin/env perl BEGIN { if ($] < 5.007003) { warn "$0 requires Perl v5.7.3 or later.\n"; exit; } unless (@ARGV) { warn "Usage: $0 string [string ...]\n"; exit; } } use 5.007003; use strict; use warnings; use open IO => qw{:encoding(UTF-8) :std}; use constant { SEP1 => '=' x 60 . "\n", SEP2 => '-' x 60 . "\n", FMT => "%s\tU+%-6X %s\n", NO_PRINT => "\N{REPLACEMENT CHARACTER}", }; use Encode 'decode'; use Unicode::UCD 'charinfo'; for my $raw_str (@ARGV) { my $str = decode('UTF-8', $raw_str); print "\n", SEP1; print "String: '$str'\n"; print SEP1; for my $char (split //, $str) { my $code_point = ord $char; my $char_info = charinfo($code_point); if (! defined $char_info) { $char_info->{name} = "<unknown> Perl $^V supports Unicode +" . Unicode::UCD::UnicodeVersion(); } printf FMT, ($char =~ /^\p{Print}$/ ? $char : NO_PRINT), $code_point, $char_info->{name}; } print SEP2; }

Here's a number of example runs. All use <pre> blocks; a very few didn't need this but I chose to go with consistency.

Works with ASCII (aka Unicode: C0 Controls and Basic Latin)

$ uparse X XY "X        Z"

============================================================
String: 'X'
============================================================
X       U+58     LATIN CAPITAL LETTER X
------------------------------------------------------------

============================================================
String: 'XY'
============================================================
X       U+58     LATIN CAPITAL LETTER X
Y       U+59     LATIN CAPITAL LETTER Y
------------------------------------------------------------

============================================================
String: 'X      Z'
============================================================
X       U+58     LATIN CAPITAL LETTER X
�       U+9      <control>
Z       U+5A     LATIN CAPITAL LETTER Z
------------------------------------------------------------

The two similar emoji heads (mentioned above)

$ uparse 🧑 👨

============================================================
String: '🧑'
============================================================
🧑      U+1F9D1  ADULT
------------------------------------------------------------

============================================================
String: '👨'
============================================================
👨      U+1F468  MAN
------------------------------------------------------------

A complex ZWJ sequence

$ uparse 👨🏽‍✈️

============================================================
String: '👨🏽‍✈️'
============================================================
👨      U+1F468  MAN
🏽      U+1F3FD  EMOJI MODIFIER FITZPATRICK TYPE-4
        U+200D   ZERO WIDTH JOINER
✈       U+2708   AIRPLANE
        U+FE0F   VARIATION SELECTOR-16
------------------------------------------------------------

Maps

$ uparse 🇨🇭

============================================================
String: '🇨🇭'
============================================================
🇨       U+1F1E8  REGIONAL INDICATOR SYMBOL LETTER C
🇭       U+1F1ED  REGIONAL INDICATOR SYMBOL LETTER H
------------------------------------------------------------

Handles codepoints not yet assigned; or not supported with certain Perl versions

$ uparse `perl -C -e 'print "X\x{1fa7c}X"'`

============================================================
String: 'X🩼X'
============================================================
X       U+58     LATIN CAPITAL LETTER X
🩼      U+1FA7C  CRUTCH
X       U+58     LATIN CAPITAL LETTER X
------------------------------------------------------------

$ uparse `perl -C -e 'print "X\x{1fa7c}X"'`

============================================================
String: 'X🩼X'
============================================================
X       U+58     LATIN CAPITAL LETTER X
�       U+1FA7C  <unknown> Perl v5.30.0 supports Unicode 12.1.0
X       U+58     LATIN CAPITAL LETTER X
------------------------------------------------------------

$ uparse `perl -C -e 'print "X\x{1fa7d}X"'`

============================================================
String: 'X🩽X'
============================================================
X       U+58     LATIN CAPITAL LETTER X
�       U+1FA7D  <unknown> Perl v5.39.3 supports Unicode 15.0.0
X       U+58     LATIN CAPITAL LETTER X
------------------------------------------------------------

Enjoy!

— Ken

Replies are listed 'Best First'.
Re: uparse - Parse Unicode strings
by Tux (Canon) on Nov 18, 2023 at 10:05 UTC

    The penguin is part of my prompt

    Download uchar

    tux 🐧 uchar --help
    usage: uchar -v [-m base:count [ -m base:count ] ...
           uchar -v -f char ...
      perl 5.38.0 with Unicode 15.0.0
    
            -m      show maps
            -v      verbosity
            -l      list GBA characters
            -f      find
            -F      find (only chars supported in current font)
             -s     splash all characters found into a single string
            -k      show matching key combo(s)
            -d      apply random diacricals
            -e      show character encodings (uchar -e -f u_BREVE)
             -o     also show octal version of encoding
            -E      show character decodings (uchar -E fc)
            -b      strip to base
            -D      show codepoints in decimal
            -c      copy found string(s) to clipboard
            -h      also show html entity if available
    
    tux 🐧 uchar -v X🩼X
    X U00058 \N{LATIN CAPITAL LETTER X}
    🩼 U1fa7c \N{CRUTCH}
    X U00058 \N{LATIN CAPITAL LETTER X}
    tux 🐧 uchar -v U+1f427
    🐧 U1f427 \N{PENGUIN}
    tux 🐧 uchar -e U+1f427
    🐧 U1f427 \N{PENGUIN}
    
      cp1026                         6f
      cp1047                         6f
      cp37                           6f
      cp424                          6f
      cp500                          6f
      cp875                          6f
      gb12345-raw                    22
      gb2312-raw                     22
      hz                             22
      iso-2022-kr                    1b2429435c787b31663432377d
      iso-ir-165                     22
      jis0208-raw                    20
      jis0212-raw                    22
      ksc5601-raw                    22
      posix-bc                       6f
      UCS-2BE                        fffd
      UCS-2LE                        fdff
      UTF-16                         feffd83ddc27
      UTF-16BE                       d83ddc27
      UTF-16LE                       3dd827dc
      UTF-32                         0000feff0001f427
      UTF-32BE                       0001f427
      UTF-32LE                       27f40100
      UTF-7                          2b324433634a772d
      utf-8-strict                   f09f90a7
      utf8                           f09f90a7
    tux 🐧 uchar -E f09f90a7 | grep utf
      utf-8-strict                   🐧
      utf8                           🐧     (U+1F427)
    tux 🐧 uchar -Fk "L WITH STROKE"
    Searching for (?^u:\bL WITH STROKE\b)
    000141 Ł LSTROKE_IDX     LATIN CAPITAL LETTER L WITH STROKE
             #<Multi_key> <L> <minus>
             #<Multi_key> <minus> <L>
             <Multi_key> <L> <slash>
             <Multi_key> <L> <underscore>
             <Multi_key> <slash> <L>
             <Multi_key> <underscore> <L>
    000142 ł lSTROKE_IDX     LATIN SMALL LETTER L WITH STROKE
             #<Multi_key> <l> <minus>
             #<Multi_key> <minus> <l>
             <Multi_key> <l> <slash>
             <Multi_key> <l> <underscore>
             <Multi_key> <slash> <l>
             <Multi_key> <underscore> <l>
    
    tux $ perl -CEO -wE'say "\x{1F468}\x{1F3FD}\x{200D}\x{2708}\x{FE0F}"'
    👨🏽✈️
    
    tux $ raku -e'"\x[1F468]\x[1F3FD]\x[200D]\x[2708]\x[FE0F]".say'
    👨🏽✈️
    
    tux $ raku -e'"\x[1F468]\x[1F3FD]\x[200D]\x[2708]\x[FE0F]".say' | xarg +s uchar -v
    👨 U1f468 \N{MAN}
    🏽 U1f3fd \N{EMOJI MODIFIER FITZPATRICK TYPE-4}
     U0200d \N{ZERO WIDTH JOINER}
    ✈ U02708 \N{AIRPLANE}
    ️ U0fe0f \N{VARIATION SELECTOR-16}
    

    Enjoy, Have FUN! H.Merijn

      That's a very comprehensive solution with substantially more functionality than I needed. It probably deserves its own CUFP page.

      — Ken

      Wow, very impressive! ... agree with kcott that it deserves its own CUFP page.

      I played briefly with your command on Ubuntu using perl v5.38:

      ~/pm/Tux$ perl -CEO -wE'say "\x{1F468}\x{1F3FD}\x{200D}\x{2708}\x{FE0F}"'
      👨🏽‍✈️
      
      ~/pm/Tux$ echo -e '\U1F468\U1F3FD\U200D\U2708\UFE0F'
      👨🏽‍✈️
      

      AFAICT, the output from the perl -CEO and the bash echo -e commands above is identical, namely:

      &#128104;&#127997;&#8205;&#9992;&#65039;

      Running this command produced useful output (that seems to match yours), despite the error messages:

      ~/pm/Tux$ echo -e '\U1F468\U1F3FD\U200D\U2708\UFE0F' | xargs uchar -v
      Can't exec "locate": No such file or directory at ~/pm/Tux/uchar line 103.
      👨 U1f468 \N{MAN}
      🏽 U1f3fd \N{EMOJI MODIFIER FITZPATRICK TYPE-4}
      ‍ U0200d \N{ZERO WIDTH JOINER}
      ✈ U02708 \N{AIRPLANE}
      ️ U0fe0f \N{VARIATION SELECTOR-16}
      

      Using CODE blocks intead of pre:

      ~/pm/Tux$ echo -e '\U1F468\U1F3FD\U200D\U2708\UFE0F' | xargs uchar -v Can't exec "locate": No such file or directory at ~/pm/Tux/uchar line +103. &#128104; U1f468 \N{MAN} &#127997; U1f3fd \N{EMOJI MODIFIER FITZPATRICK TYPE-4} &#8205; U0200d \N{ZERO WIDTH JOINER} &#9992; U02708 \N{AIRPLANE} &#65039; U0fe0f \N{VARIATION SELECTOR-16}

      👁️🍾👍🦟

        Fetch again. Now guarded. /me wonders how people work on a devel machine without mlocate :)


        Enjoy, Have FUN! H.Merijn
Re: uparse - Parse Unicode strings
by ikegami (Patriarch) on Nov 19, 2023 at 04:43 UTC

    See also: unichars and uniprops from Unicode::Tussle.

    $ unichars '\p{Emoji}' | wc -l
    178
    
    $ unichars '\p{Emoji}' | head -n 30 | tail -n 20
     8  U+0038 DIGIT EIGHT
    ‭ 9  U+0039 DIGIT NINE
    ‭ ©  U+00A9 COPYRIGHT SIGN
    ‭ ®  U+00AE REGISTERED SIGN
    ‭ ‼  U+203C DOUBLE EXCLAMATION MARK
    ‭ ⁉  U+2049 EXCLAMATION QUESTION MARK
    ‭ ™  U+2122 TRADE MARK SIGN
    ‭ ℹ  U+2139 INFORMATION SOURCE
    ‭ ↔  U+2194 LEFT RIGHT ARROW
    ‭ ↕  U+2195 UP DOWN ARROW
    ‭ ↖  U+2196 NORTH WEST ARROW
    ‭ ↗  U+2197 NORTH EAST ARROW
    ‭ ↘  U+2198 SOUTH EAST ARROW
    ‭ ↙  U+2199 SOUTH WEST ARROW
    ‭ ↩  U+21A9 LEFTWARDS ARROW WITH HOOK
    ‭ ↪  U+21AA RIGHTWARDS ARROW WITH HOOK
    ‭ ⌚ U+231A WATCH
    ‭ ⌛ U+231B HOURGLASS
    ‭ ⌨  U+2328 KEYBOARD
    ‭ ⏏  U+23CF EJECT SYMBOL
    
    $ uniprops 🧑
    U+1F9D1 ‹🧑› \N{ADULT}
        \pS \p{So}
        All Any Assigned Common Zyyy EBase Emoji_Modifier_Base Emoji Emoji_Presentation EPres Extended_Pictographic ExtPict
           So S Gr_Base Grapheme_Base Graph X_POSIX_Graph GrBase Other_Symbol Print X_POSIX_Print Symbol
           Sup_Symbols_And_Pictographs Supplemental_Symbols_And_Pictographs InSupSymbolsAndPictographs Unicode
    
    $ uniprops U+1F9D1
    U+1F9D1 ‹🧑› \N{ADULT}
        \pS \p{So}
        All Any Assigned Common Zyyy EBase Emoji_Modifier_Base Emoji Emoji_Presentation EPres Extended_Pictographic ExtPict
           So S Gr_Base Grapheme_Base Graph X_POSIX_Graph GrBase Other_Symbol Print X_POSIX_Print Symbol
           Sup_Symbols_And_Pictographs Supplemental_Symbols_And_Pictographs InSupSymbolsAndPictographs Unicode
    

      Thanks for that. There's a huge amount of documentation to go through. I've had a brief look and it seems like there are a number of very useful tools.

      — Ken

Re: uparse - Parse Unicode strings
by hippo (Bishop) on Nov 18, 2023 at 10:19 UTC
    BEGIN { if ($] < 5.007003) { warn "$0 requires Perl v5.7.3 or later.\n"; exit; } unless (@ARGV) { warn "Usage: $0 string [string ...]\n"; exit; } }

    I'm intrigued as to why you would warn and then immediately exit instead of just die. eg.

    BEGIN { die "$0 requires Perl v5.7.3 or later.\n" if $] < 5.007003; die "Usage: $0 string [string ...]\n" unless @ARGV; }

    Please enlighten me?


    🦛

      With my original code, the messages look like this:

      $ uparse /home/ken/local/bin/uparse requires Perl v5.7.3 or later. $ uparse Usage: /home/ken/local/bin/uparse string [string ...]

      With your suggestion, the messages look like this:

      $ uparse /home/ken/local/bin/uparse requires Perl v5.7.3 or later. BEGIN failed--compilation aborted at /home/ken/local/bin/uparse line 1 +5. $ uparse Usage: /home/ken/local/bin/uparse string [string ...] BEGIN failed--compilation aborted at /home/ken/local/bin/uparse line 1 +5.

      I didn't want the "BEGIN failed--compilation aborted at ..." lines.

      — Ken

        Thanks - I understand now. It's for neatness of output (well, stderr really) and is only an issue because of the BEGIN block which itself is necessary for the version check to fire before we hit newer syntax/features.


        🦛

Re: uparse - Parse Unicode strings
by eyepopslikeamosquito (Archbishop) on Nov 18, 2023 at 10:15 UTC

    Brilliant work kcott!

    Everything I've tested so far works like a charm on my Ubuntu Linux VM (running perl v5.38.0 built from source as described here).

    A lot more convenient than the crude hack I was using, namely to click on the little xml link on a post to see the decimal values of the Unicode emojis. For example, clicking on the xml link on your post now allows me to see:

    ... difficult to tell them apart; e.g. <tt>&#129489;</tt> & <tt>&#1281 +04;</tt>.

    which I can then crudely translate back and forth between hex and decimal via one liners such as:

    C:\> perl -e "printf q{%X}, 129489" 1F9D1 C:\> perl -e "printf q{%d}, 0x1F9D1" 129489

    That was working fine until the Discipulus posted an emoji to me in the Chatterbox the other day ... and, oops, there was no xml link to click on! :)

    👁️🍾👍🦟

      I'm glad you liked it.

      It was actually prompted when looking at "Emojis for Perl Monk names" and being unable to determine what the emoji for tye was. Now that I know, it seems obvious:

      $ uparse 👔
      
      ============================================================
      String: '👔'
      ============================================================
      👔      U+1F454  NECKTIE
      ------------------------------------------------------------
      

      The emoji for gellyfish didn't even render for me; but I was still able to get information about it.

      $ uparse 🪼
      
      ============================================================
      String: '🪼'
      ============================================================
      🪼      U+1FABC  JELLYFISH
      ------------------------------------------------------------
      

      There's also things like the emoji for GrandFather, which I can only select as a single entity, but would benefit from some analysis.

      $ uparse 👨‍🦳‍👧‍👦
      
      ============================================================
      String: '👨‍🦳‍👧‍👦'
      ============================================================
      👨      U+1F468  MAN
              U+200D   ZERO WIDTH JOINER
      🦳      U+1F9B3  EMOJI COMPONENT WHITE HAIR
              U+200D   ZERO WIDTH JOINER
      👧      U+1F467  GIRL
              U+200D   ZERO WIDTH JOINER
      👦      U+1F466  BOY
      ------------------------------------------------------------
      

      Maybe at some future point we can add the white hair to this family setting:

      $ uparse 👨‍👧‍👦
      
      ============================================================
      String: '👨‍👧‍👦'
      ============================================================
      👨      U+1F468  MAN
              U+200D   ZERO WIDTH JOINER
      👧      U+1F467  GIRL
              U+200D   ZERO WIDTH JOINER
      👦      U+1F466  BOY
      ------------------------------------------------------------
      

      Although, maybe you can already do this with your Win11 Segoe UI Emoji font. Can you?

      — Ken

        Maybe at some future point we can add the white hair to this family setting ... maybe you can already do this with your Win11 Segoe UI Emoji font. Can you?

        You read me like a book, that's exactly what I was trying to do! :) ... and was bitterly disappointed when it didn't work.

        For completeness, I ran a simple standalone test using Windows 11 PowerShell.

        PS C:\> $joiner = [char]::ConvertFromUtf32(0x200D) PS C:\> $man = [char]::ConvertFromUtf32(0x1F468) PS C:\> $girl = [char]::ConvertFromUtf32(0x1F467) PS C:\> $boy = [char]::ConvertFromUtf32(0x1F466) PS C:\> $whitehair = [char]::ConvertFromUtf32(0x1F9B3)

        PS C:\> "$man$joiner$girl$joiner$boy"
        👨‍👧‍👦
        

        PS C:\> "$man$joiner$whitehair$joiner$girl$joiner$boy"
        👨‍🦳‍👧‍👦
        

        Running equivalent test on Ubuntu bash with echo -e produced the same depressing result. It seems you can enjoy a family emoji with a default man, but not a man with white hair. Maybe a Unicode emoji expert knows how to do it, but I don't.

        👁️🍾👍🦟
Decoding @ARGV [Was: uparse - Parse Unicode strings]
by jo37 (Deacon) on Nov 22, 2023 at 20:38 UTC

    Hi Ken!

    Tried to find a general solution to the problem reported in Re: uparse - Parse Unicode strings.


    Short explanation of the problem:
    There are two basic ways to get correct UNICODE input from the elements in @ARGV:

    • implicit decoding with a runtime option -CA or an environment setting PERL_UNICODE=A
    • explicit decoding using Encode::decode
    Either may be used, but not both.


    A script that expects UNICODE data from @ARGV cannot easily detect if the implicit decoding is in effect, especially because -CAL makes the behaviour locale-dependent.

    The best solution I could find is to check if the data in question is already marked to be in UTF-8. Encode::is_utf8 (or the equivalent utf8::is_utf8) may be used to check this flag, which results in a small modification to your script:

    diff --git a/uparse b/uparse index f5edb92..b05e12a 100755 --- a/uparse +++ b/uparse @@ -23,11 +23,11 @@ use constant { NO_PRINT => "\N{REPLACEMENT CHARACTER}", }; -use Encode 'decode'; +use Encode qw(decode is_utf8); use Unicode::UCD 'charinfo'; for my $raw_str (@ARGV) { - my $str = decode('UTF-8', $raw_str); + my $str = is_utf8($raw_str) ? $raw_str : decode('UTF-8', $raw_str +); print "\n", SEP1; print "String: '$str'\n"; print SEP1;

    What do you think about this?

    Greetings,
    -jo

    $gryYup$d0ylprbpriprrYpkJl2xyl~rzg??P~5lp2hyl0p$

      ++ Thanks for your analysis and patch.

      I had planned, assuming I had sufficient time to spare, to have a further look at uparse this weekend and get it to work on different platforms, and in various environments. What you've provided is a good start and helps a lot.

      — Ken

      [A follow-up to "Re: Decoding @ARGV [Was: uparse - Parse Unicode strings]".]

      I'm not going to have sufficient spare time to do all that I wanted this weekend. I have managed to incorporate your changes and do a couple of other minor things.

      When prefixing the uparse command with PERL_UNICODE=A or PERL_UNICODE=SDAL, I just get "Wide character at ..." and no other output. I made these changes:

      • Added changes from your patch.
      • Changed "use open IO ..." to "use open OUT ...".
      • Modified the code layout (mostly to avoid wrapping in PM).

      Here's the new code:

      #!/usr/bin/env perl BEGIN { if ($] < 5.007003) { warn "$0 requires Perl v5.7.3 or later.\n"; exit; } unless (@ARGV) { warn "Usage: $0 string [string ...]\n"; exit; } } use 5.007003; use strict; use warnings; use open OUT => qw{:encoding(UTF-8) :std}; use constant { SEP1 => '=' x 60 . "\n", SEP2 => '-' x 60 . "\n", FMT => "%s\tU+%-6X %s\n", NO_PRINT => "\N{REPLACEMENT CHARACTER}", }; use Encode qw{decode is_utf8}; use Unicode::UCD 'charinfo'; for my $raw_str (@ARGV) { my $str = is_utf8($raw_str) ? $raw_str : decode('UTF-8', $raw_str); print "\n", SEP1; print "String: '$str'\n"; print SEP1; for my $char (split //, $str) { my $code_point = ord $char; my $char_info = charinfo($code_point); if (! defined $char_info) { $char_info->{name} = "<unknown> Perl $^V supports Unicode " . Unicode::UCD::UnicodeVersion(); } printf FMT, ($char =~ /^\p{Print}$/ ? $char : NO_PRINT), $code_point, $char_info->{name}; } print SEP2; }

      Here's a test run with just uparse:

      $ uparse 👮🏼 👮🏼‍♀️ 👮🏼‍♂️
      
      ============================================================
      String: '👮🏼'
      ============================================================
      👮      U+1F46E  POLICE OFFICER
      🏼      U+1F3FC  EMOJI MODIFIER FITZPATRICK TYPE-3
      ------------------------------------------------------------
      
      ============================================================
      String: '👮🏼‍♀️'
      ============================================================
      👮      U+1F46E  POLICE OFFICER
      🏼      U+1F3FC  EMOJI MODIFIER FITZPATRICK TYPE-3
              U+200D   ZERO WIDTH JOINER
      ♀       U+2640   FEMALE SIGN
              U+FE0F   VARIATION SELECTOR-16
      ------------------------------------------------------------
      
      ============================================================
      String: '👮🏼‍♂️'
      ============================================================
      👮      U+1F46E  POLICE OFFICER
      🏼      U+1F3FC  EMOJI MODIFIER FITZPATRICK TYPE-3
              U+200D   ZERO WIDTH JOINER
      ♂       U+2642   MALE SIGN
              U+FE0F   VARIATION SELECTOR-16
      ------------------------------------------------------------
      

      And again, this time with PERL_UNICODE=A:

      $ PERL_UNICODE=A uparse 👮🏼 👮🏼‍♀️ 👮🏼‍♂️
      
      ============================================================
      String: '👮🏼'
      ============================================================
      👮      U+1F46E  POLICE OFFICER
      🏼      U+1F3FC  EMOJI MODIFIER FITZPATRICK TYPE-3
      ------------------------------------------------------------
      
      ============================================================
      String: '👮🏼‍♀️'
      ============================================================
      👮      U+1F46E  POLICE OFFICER
      🏼      U+1F3FC  EMOJI MODIFIER FITZPATRICK TYPE-3
              U+200D   ZERO WIDTH JOINER
      ♀       U+2640   FEMALE SIGN
              U+FE0F   VARIATION SELECTOR-16
      ------------------------------------------------------------
      
      ============================================================
      String: '👮🏼‍♂️'
      ============================================================
      👮      U+1F46E  POLICE OFFICER
      🏼      U+1F3FC  EMOJI MODIFIER FITZPATRICK TYPE-3
              U+200D   ZERO WIDTH JOINER
      ♂       U+2642   MALE SIGN
              U+FE0F   VARIATION SELECTOR-16
      ------------------------------------------------------------
      

      Using "PERL_UNICODE=SDAL" gives the same output as "PERL_UNICODE=A".

      — Ken

Re: uparse - Parse Unicode strings
by jo37 (Deacon) on Nov 19, 2023 at 21:52 UTC

    I don't know what is wrong with my locale setup. Neither uparse nor uchar work on my old perl 5.032001 on Debian 11.

    $ ./uparse.pl äöü
    
    ============================================================
    String: '���'
    ============================================================
    �	U+FFFD   REPLACEMENT CHARACTER
    �	U+FFFD   REPLACEMENT CHARACTER
    �	U+FFFD   REPLACEMENT CHARACTER
    ------------------------------------------------------------
    
    $ ./uchar.pl -v äöü
    � U0fffd \N{REPLACEMENT CHARACTER}
    � U0fffd \N{REPLACEMENT CHARACTER}
    � U0fffd \N{REPLACEMENT CHARACTER}
    
    

    Removing decode from uparse.pl resolves the problem:

    $ ./uparse.pl äöü ============================================================ String: 'äöü' ============================================================ ä U+E4 LATIN SMALL LETTER A WITH DIAERESIS ö U+F6 LATIN SMALL LETTER O WITH DIAERESIS ü U+FC LATIN SMALL LETTER U WITH DIAERESIS ------------------------------------------------------------

    Greetings,
    -jo

    $gryYup$d0ylprbpriprrYpkJl2xyl~rzg??P~5lp2hyl0p$

      Thanks for the feedback. I don't have a Debian available; I'm running Cygwin with Perlbrew and was able to wind back to v5.32.0 (the closest I have to your v5.32.1). Under that version I have Unicode::UCD 0.75 and Encode 3.06 — what do you have? Here's a few tests.

      $ perl -v | head -2 | tail -1 This is perl 5, version 32, subversion 0 (v5.32.0) built for cygwin-th +read-multi

      I saw the three vowels (WITH DIAERESIS) on the web page. They didn't change when I pasted them onto my command line; nor in the uparse output. However, when I pasted the results back here:

      $ uparse äöü
      
      ============================================================
      String: 'äöü'
      ============================================================
      ä       U+E4     LATIN SMALL LETTER A WITH DIAERESIS
      ö       U+F6     LATIN SMALL LETTER O WITH DIAERESIS
      ü       U+FC     LATIN SMALL LETTER U WITH DIAERESIS
      ------------------------------------------------------------
      

      And just so that you know what I'm seeing:

      $ uparse äöü
      
      ============================================================
      String: 'äöü'
      ============================================================
      Ã       U+C3     LATIN CAPITAL LETTER A WITH TILDE
      ¤       U+A4     CURRENCY SIGN
      Ã       U+C3     LATIN CAPITAL LETTER A WITH TILDE
      ¶       U+B6     PILCROW SIGN
      Ã       U+C3     LATIN CAPITAL LETTER A WITH TILDE
      ¼       U+BC     VULGAR FRACTION ONE QUARTER
      ------------------------------------------------------------
      

      There were no surprises with my other tests.

      $ uparse ���
      
      ============================================================
      String: '���'
      ============================================================
      �       U+FFFD   REPLACEMENT CHARACTER
      �       U+FFFD   REPLACEMENT CHARACTER
      �       U+FFFD   REPLACEMENT CHARACTER
      ------------------------------------------------------------
      
      $ uparse 👨‍🦳‍👧‍👦
      
      ============================================================
      String: '👨‍🦳‍👧‍👦'
      ============================================================
      👨      U+1F468  MAN
              U+200D   ZERO WIDTH JOINER
      🦳      U+1F9B3  EMOJI COMPONENT WHITE HAIR
              U+200D   ZERO WIDTH JOINER
      👧      U+1F467  GIRL
              U+200D   ZERO WIDTH JOINER
      👦      U+1F466  BOY
      ------------------------------------------------------------
      
      $ uparse 👨🏽‍✈️
      
      ============================================================
      String: '👨🏽‍✈️'
      ============================================================
      👨      U+1F468  MAN
      🏽      U+1F3FD  EMOJI MODIFIER FITZPATRICK TYPE-4
              U+200D   ZERO WIDTH JOINER
      ✈       U+2708   AIRPLANE
              U+FE0F   VARIATION SELECTOR-16
      ------------------------------------------------------------
      
      $ uparse X🩼X
      
      ============================================================
      String: 'X🩼X'
      ============================================================
      X       U+58     LATIN CAPITAL LETTER X
      �       U+1FA7C  <unknown> Perl v5.32.0 supports Unicode 13.0.0
      X       U+58     LATIN CAPITAL LETTER X
      ------------------------------------------------------------
      
      $ uparse `perl -C -e 'print "X\x{1fa7d}X"'`
      
      ============================================================
      String: 'X🩽X'
      ============================================================
      X       U+58     LATIN CAPITAL LETTER X
      �       U+1FA7D  <unknown> Perl v5.32.0 supports Unicode 13.0.0
      X       U+58     LATIN CAPITAL LETTER X
      ------------------------------------------------------------
      

      You mentioned "locale setup" but didn't say what you have. I have:

      LANG=en_AU.UTF-8 LC_ALL=en_AU.UTF-8 LC_COLLATE=en_AU.UTF-8 LC_CTYPE=en_AU.UTF-8 LC_MESSAGES=en_AU.UTF-8 LC_MONETARY=en_AU.UTF-8 LC_NUMERIC=en_AU.UTF-8 LC_TIME=en_AU.UTF-8

      That's the best I can do. Perhaps someone with the same O/S and Perl version as you can shed more light on your problem.

      — Ken

      Just as something of a sanity test for me, and perhaps a test you could try for yourself, here's uparse with its argument taken from three different sources not associated with PerlMonks.

      $ perlbrew switch perl-5.32.0 $ perl -v | head -2 | tail -1 This is perl 5, version 32, subversion 0 (v5.32.0) built for cygwin-th +read-multi

      Copy-pasted from the Unicode PDF code chart "C1 Controls and Latin-1 Supplement (Range: 0080-00FF)":

      $ uparse äöü
      
      ============================================================
      String: 'äöü'
      ============================================================
      ä       U+E4     LATIN SMALL LETTER A WITH DIAERESIS
      ö       U+F6     LATIN SMALL LETTER O WITH DIAERESIS
      ü       U+FC     LATIN SMALL LETTER U WITH DIAERESIS
      ------------------------------------------------------------
      

      Generated directly from a perl command:

      $ uparse `perl -C -e 'print "\x{e4}\x{f6}\x{fc}"'`
      
      ============================================================
      String: 'äöü'
      ============================================================
      ä       U+E4     LATIN SMALL LETTER A WITH DIAERESIS
      ö       U+F6     LATIN SMALL LETTER O WITH DIAERESIS
      ü       U+FC     LATIN SMALL LETTER U WITH DIAERESIS
      ------------------------------------------------------------
      

      Generated separately then copy-pasted as an argument to uparse:

      $ perl -C -e 'print "\N{LATIN SMALL LETTER A WITH DIAERESIS}\N{LATIN SMALL LETTER O WITH DIAERESIS}\N{LATIN SMALL LETTER U WITH DIAERESIS}"'
      äöü
      
      $ uparse äöü
      
      ============================================================
      String: 'äöü'
      ============================================================
      ä       U+E4     LATIN SMALL LETTER A WITH DIAERESIS
      ö       U+F6     LATIN SMALL LETTER O WITH DIAERESIS
      ü       U+FC     LATIN SMALL LETTER U WITH DIAERESIS
      ------------------------------------------------------------
      

      — Ken

        Hi Ken!

        I found the reason for the strange behaviour: I didn't even remember, but I have PERL_UNICODE=SDAL set. Without this variable the script works correctly. More specifically, it's the "A" in it. From perlrun:

        A 32 the @ARGV elements are expected to be strings encoded in UTF-8

        Thank you very much for your investigations!

        Greetings,
        -jo

        $gryYup$d0ylprbpriprrYpkJl2xyl~rzg??P~5lp2hyl0p$

      The script assumes your terminal uses UTF-8. However, you are not using a UTF-8 locale. You should look into switching to a UTF-8 locale.

      I didn't notice there were other comments already.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: CUFP [id://11155675]
Approved by hippo
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others about the Monastery: (5)
As of 2024-05-19 06:56 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found