uparse - Parse Unicode strings

Replies are listed 'Best First'.
Re: uparse - Parse Unicode strings by Tux (Canon) on Nov 18, 2023 at 10:05 UTC
The penguin is part of my prompt Download `uchar` tux 🐧 uchar --help usage: uchar -v [-m base:count [ -m base:count ] ... uchar -v -f char ... perl 5.38.0 with Unicode 15.0.0 -m show maps -v verbosity -l list GBA characters -f find -F find (only chars supported in current font) -s splash all characters found into a single string -k show matching key combo(s) -d apply random diacricals -e show character encodings (uchar -e -f u_BREVE) -o also show octal version of encoding -E show character decodings (uchar -E fc) -b strip to base -D show codepoints in decimal -c copy found string(s) to clipboard -h also show html entity if available tux 🐧 uchar -v X🩼X X U00058 \N{LATIN CAPITAL LETTER X} 🩼 U1fa7c \N{CRUTCH} X U00058 \N{LATIN CAPITAL LETTER X} tux 🐧 uchar -v U+1f427 🐧 U1f427 \N{PENGUIN} tux 🐧 uchar -e U+1f427 🐧 U1f427 \N{PENGUIN} cp1026 6f cp1047 6f cp37 6f cp424 6f cp500 6f cp875 6f gb12345-raw 22 gb2312-raw 22 hz 22 iso-2022-kr 1b2429435c787b31663432377d iso-ir-165 22 jis0208-raw 20 jis0212-raw 22 ksc5601-raw 22 posix-bc 6f UCS-2BE fffd UCS-2LE fdff UTF-16 feffd83ddc27 UTF-16BE d83ddc27 UTF-16LE 3dd827dc UTF-32 0000feff0001f427 UTF-32BE 0001f427 UTF-32LE 27f40100 UTF-7 2b324433634a772d utf-8-strict f09f90a7 utf8 f09f90a7 tux 🐧 uchar -E f09f90a7 \| grep utf utf-8-strict 🐧 utf8 🐧 (U+1F427) tux 🐧 uchar -Fk "L WITH STROKE" Searching for (?^u:\bL WITH STROKE\b) 000141 Ł LSTROKE_IDX LATIN CAPITAL LETTER L WITH STROKE #<Multi_key> <L> <minus> #<Multi_key> <minus> <L> <Multi_key> <L> <slash> <Multi_key> <L> <underscore> <Multi_key> <slash> <L> <Multi_key> <underscore> <L> 000142 ł lSTROKE_IDX LATIN SMALL LETTER L WITH STROKE #<Multi_key> <l> <minus> #<Multi_key> <minus> <l> <Multi_key> <l> <slash> <Multi_key> <l> <underscore> <Multi_key> <slash> <l> <Multi_key> <underscore> <l> `tux $ perl -CEO -wE'say "\x{1F468}\x{1F3FD}\x{200D}\x{2708}\x{FE0F}"'` [download] 👨🏽✈️ `tux $ raku -e'"\x[1F468]\x[1F3FD]\x[200D]\x[2708]\x[FE0F]".say'` [download] 👨🏽✈️ `tux $ raku -e'"\x[1F468]\x[1F3FD]\x[200D]\x[2708]\x[FE0F]".say' \| xarg +s uchar -v` [download] 👨 U1f468 \N{MAN} 🏽 U1f3fd \N{EMOJI MODIFIER FITZPATRICK TYPE-4} U0200d \N{ZERO WIDTH JOINER} ✈ U02708 \N{AIRPLANE} ️ U0fe0f \N{VARIATION SELECTOR-16} Enjoy, Have FUN! H.Merijn	[reply] [d/l] [select]
Re^2: uparse - Parse Unicode strings by kcott (Archbishop) on Nov 18, 2023 at 13:55 UTC
That's a very comprehensive solution with substantially more functionality than I needed. It probably deserves its own CUFP page. — Ken	[reply]
Re^2: uparse - Parse Unicode strings by eyepopslikeamosquito (Archbishop) on Nov 19, 2023 at 02:50 UTC
Wow, very impressive! ... agree with kcott that it deserves its own CUFP page. I played briefly with your command on Ubuntu using `perl v5.38`: ~/pm/Tux$ perl -CEO -wE'say "\x{1F468}\x{1F3FD}\x{200D}\x{2708}\x{FE0F}"' 👨🏽‍✈️ ~/pm/Tux$ echo -e '\U1F468\U1F3FD\U200D\U2708\UFE0F' 👨🏽‍✈️ AFAICT, the output from the `perl -CEO` and the bash `echo -e` commands above is identical, namely: `👨🏽‍✈️` [download] Running this command produced useful output (that seems to match yours), despite the error messages: ~/pm/Tux$ echo -e '\U1F468\U1F3FD\U200D\U2708\UFE0F' \| xargs uchar -v Can't exec "locate": No such file or directory at ~/pm/Tux/uchar line 103. 👨 U1f468 \N{MAN} 🏽 U1f3fd \N{EMOJI MODIFIER FITZPATRICK TYPE-4} ‍ U0200d \N{ZERO WIDTH JOINER} ✈ U02708 \N{AIRPLANE} ️ U0fe0f \N{VARIATION SELECTOR-16} Using `CODE` blocks intead of `pre`: `~/pm/Tux$ echo -e '\U1F468\U1F3FD\U200D\U2708\UFE0F' \| xargs uchar -v Can't exec "locate": No such file or directory at ~/pm/Tux/uchar line +103. 👨 U1f468 \N{MAN} 🏽 U1f3fd \N{EMOJI MODIFIER FITZPATRICK TYPE-4} ‍ U0200d \N{ZERO WIDTH JOINER} ✈ U02708 \N{AIRPLANE} ️ U0fe0f \N{VARIATION SELECTOR-16}` [download] 👁️🍾👍🦟	[reply] [d/l] [select]
Re^3: uparse - Parse Unicode strings by Tux (Canon) on Nov 20, 2023 at 08:54 UTC
Fetch again. Now guarded. /me wonders how people work on a devel machine without mlocate :) Enjoy, Have FUN! H.Merijn	[reply]
Re^4: uparse - Parse Unicode strings (locate/find/xargs) by eyepopslikeamosquito (Archbishop) on Dec 02, 2023 at 10:07 UTC
Re^5: uparse - Parse Unicode strings by hippo (Bishop) on Dec 02, 2023 at 10:51 UTC
Some notes below your chosen depth have not been shown here
Re^5: uparse - Parse Unicode strings by kcott (Archbishop) on Dec 02, 2023 at 11:01 UTC
Re^4: uparse - Parse Unicode strings by eyepopslikeamosquito (Archbishop) on Nov 20, 2023 at 11:57 UTC
Re: uparse - Parse Unicode strings by ikegami (Patriarch) on Nov 19, 2023 at 04:43 UTC
See also: `unichars` and `uniprops` from Unicode::Tussle. $ unichars '\p{Emoji}' \| wc -l 178 $ unichars '\p{Emoji}' \| head -n 30 \| tail -n 20 8 U+0038 DIGIT EIGHT ‭ 9 U+0039 DIGIT NINE ‭ � U+00A9 COPYRIGHT SIGN ‭ � U+00AE REGISTERED SIGN ‭ ‼ U+203C DOUBLE EXCLAMATION MARK ‭ ⁉ U+2049 EXCLAMATION QUESTION MARK ‭ � U+2122 TRADE MARK SIGN ‭ ℹ U+2139 INFORMATION SOURCE ‭ ↔ U+2194 LEFT RIGHT ARROW ‭ ↕ U+2195 UP DOWN ARROW ‭ ↖ U+2196 NORTH WEST ARROW ‭ ↗ U+2197 NORTH EAST ARROW ‭ ↘ U+2198 SOUTH EAST ARROW ‭ ↙ U+2199 SOUTH WEST ARROW ‭ ↩ U+21A9 LEFTWARDS ARROW WITH HOOK ‭ ↪ U+21AA RIGHTWARDS ARROW WITH HOOK ‭ ⌚ U+231A WATCH ‭ ⌛ U+231B HOURGLASS ‭ ⌨ U+2328 KEYBOARD ‭ ⏏ U+23CF EJECT SYMBOL $ uniprops 🧑 U+1F9D1 �🧑� \N{ADULT} \pS \p{So} All Any Assigned Common Zyyy EBase Emoji_Modifier_Base Emoji Emoji_Presentation EPres Extended_Pictographic ExtPict So S Gr_Base Grapheme_Base Graph X_POSIX_Graph GrBase Other_Symbol Print X_POSIX_Print Symbol Sup_Symbols_And_Pictographs Supplemental_Symbols_And_Pictographs InSupSymbolsAndPictographs Unicode $ uniprops U+1F9D1 U+1F9D1 �🧑� \N{ADULT} \pS \p{So} All Any Assigned Common Zyyy EBase Emoji_Modifier_Base Emoji Emoji_Presentation EPres Extended_Pictographic ExtPict So S Gr_Base Grapheme_Base Graph X_POSIX_Graph GrBase Other_Symbol Print X_POSIX_Print Symbol Sup_Symbols_And_Pictographs Supplemental_Symbols_And_Pictographs InSupSymbolsAndPictographs Unicode	[reply] [d/l] [select]
Re^2: uparse - Parse Unicode strings by kcott (Archbishop) on Nov 19, 2023 at 11:21 UTC
Thanks for that. There's a huge amount of documentation to go through. I've had a brief look and it seems like there are a number of very useful tools. — Ken	[reply]
Re^2: uparse - Parse Unicode strings (uparse/uchar/unichars/uniprops) by eyepopslikeamosquito (Archbishop) on Nov 21, 2023 at 00:55 UTC
See also: `unichars` and `uniprops` from `Unicode::Tussle` Thanks! Finally got around to installing Unicode::Tussle on Ubuntu `perl v5.38` and am pleased to report all your examples worked fine for me, albeit with a harmless looking `"charnames: some short character names may clash in [GREEK, LATIN], for example GAMMA"` warning written to stderr. I'm now spoilt for choice, with three different working Unicode tools to choose from: uparse - Parse Unicode strings (updated here) - `uparse` command by kcott Re: uparse - Parse Unicode strings (get from `uchar`) - `uchar` command by Tux Re: uparse - Parse Unicode strings - `unichars` and `uniprops` commands from Unicode::Tussle, mentioned by ikegami 👁️🍾👍🦟	[reply] [d/l] [select]
Re: uparse - Parse Unicode strings by hippo (Bishop) on Nov 18, 2023 at 10:19 UTC
`BEGIN { if ($] < 5.007003) { warn "$0 requires Perl v5.7.3 or later.\n"; exit; } unless (@ARGV) { warn "Usage: $0 string [string ...]\n"; exit; } }` [download] I'm intrigued as to why you would warn and then immediately exit instead of just die. eg. `BEGIN { die "$0 requires Perl v5.7.3 or later.\n" if $] < 5.007003; die "Usage: $0 string [string ...]\n" unless @ARGV; }` [download] Please enlighten me? 🦛	[reply] [d/l] [select]
Re^2: uparse - Parse Unicode strings by kcott (Archbishop) on Nov 18, 2023 at 13:40 UTC
With my original code, the messages look like this: `$ uparse /home/ken/local/bin/uparse requires Perl v5.7.3 or later. $ uparse Usage: /home/ken/local/bin/uparse string [string ...]` [download] With your suggestion, the messages look like this: `$ uparse /home/ken/local/bin/uparse requires Perl v5.7.3 or later. BEGIN failed--compilation aborted at /home/ken/local/bin/uparse line 1 +5. $ uparse Usage: /home/ken/local/bin/uparse string [string ...] BEGIN failed--compilation aborted at /home/ken/local/bin/uparse line 1 +5.` [download] I didn't want the "`BEGIN failed--compilation aborted at ...`" lines. — Ken	[reply] [d/l] [select]
Re^3: uparse - Parse Unicode strings by hippo (Bishop) on Nov 18, 2023 at 14:47 UTC
Thanks - I understand now. It's for neatness of output (well, stderr really) and is only an issue because of the BEGIN block which itself is necessary for the version check to fire before we hit newer syntax/features. 🦛	[reply]
Re: uparse - Parse Unicode strings by eyepopslikeamosquito (Archbishop) on Nov 18, 2023 at 10:15 UTC
Brilliant work kcott! Everything I've tested so far works like a charm on my Ubuntu Linux VM (running `perl v5.38.0` built from source as described here). A lot more convenient than the crude hack I was using, namely to click on the little `xml` link on a post to see the decimal values of the Unicode emojis. For example, clicking on the `xml` link on your post now allows me to see: `... difficult to tell them apart; e.g. <tt>🧑</tt> & <tt>&#1281 +04;</tt>.` [download] which I can then crudely translate back and forth between hex and decimal via one liners such as: `C:\> perl -e "printf q{%X}, 129489" 1F9D1 C:\> perl -e "printf q{%d}, 0x1F9D1" 129489` [download] That was working fine until the Discipulus posted an emoji to me in the Chatterbox the other day ... and, oops, there was no `xml` link to click on! :) 👁️🍾👍🦟	[reply] [d/l] [select]
Re^2: uparse - Parse Unicode strings by kcott (Archbishop) on Nov 18, 2023 at 15:15 UTC
I'm glad you liked it. It was actually prompted when looking at "Emojis for Perl Monk names" and being unable to determine what the emoji for `tye` was. Now that I know, it seems obvious: $ uparse 👔 ============================================================ String: '👔' ============================================================ 👔 U+1F454 NECKTIE ------------------------------------------------------------ The emoji for `gellyfish` didn't even render for me; but I was still able to get information about it. $ uparse 🪼 ============================================================ String: '🪼' ============================================================ 🪼 U+1FABC JELLYFISH ------------------------------------------------------------ There's also things like the emoji for `GrandFather`, which I can only select as a single entity, but would benefit from some analysis. $ uparse 👨‍🦳‍👧‍👦 ============================================================ String: '👨‍🦳‍👧‍👦' ============================================================ 👨 U+1F468 MAN U+200D ZERO WIDTH JOINER 🦳 U+1F9B3 EMOJI COMPONENT WHITE HAIR U+200D ZERO WIDTH JOINER 👧 U+1F467 GIRL U+200D ZERO WIDTH JOINER 👦 U+1F466 BOY ------------------------------------------------------------ Maybe at some future point we can add the white hair to this family setting: $ uparse 👨‍👧‍👦 ============================================================ String: '👨‍👧‍👦' ============================================================ 👨 U+1F468 MAN U+200D ZERO WIDTH JOINER 👧 U+1F467 GIRL U+200D ZERO WIDTH JOINER 👦 U+1F466 BOY ------------------------------------------------------------ Although, maybe you can already do this with your Win11 Segoe UI Emoji font. Can you? — Ken	[reply] [d/l] [select]
Re^3: uparse - Parse Unicode strings by eyepopslikeamosquito (Archbishop) on Nov 19, 2023 at 07:06 UTC
Maybe at some future point we can add the white hair to this family setting ... maybe you can already do this with your Win11 `Segoe UI Emoji` font. Can you? You read me like a book, that's exactly what I was trying to do! :) ... and was bitterly disappointed when it didn't work. For completeness, I ran a simple standalone test using Windows 11 PowerShell. `PS C:\> $joiner = [char]::ConvertFromUtf32(0x200D) PS C:\> $man = [char]::ConvertFromUtf32(0x1F468) PS C:\> $girl = [char]::ConvertFromUtf32(0x1F467) PS C:\> $boy = [char]::ConvertFromUtf32(0x1F466) PS C:\> $whitehair = [char]::ConvertFromUtf32(0x1F9B3)` [download] PS C:\> "$man$joiner$girl$joiner$boy" 👨‍👧‍👦 PS C:\> "$man$joiner$whitehair$joiner$girl$joiner$boy" 👨‍🦳‍👧‍👦 Running equivalent test on Ubuntu `bash` with `echo -e` produced the same depressing result. It seems you can enjoy a family emoji with a default man, but not a man with white hair. Maybe a Unicode emoji expert knows how to do it, but I don't. 👁️🍾👍🦟	[reply] [d/l] [select]
Re^4: uparse - Parse Unicode strings by kcott (Archbishop) on Nov 19, 2023 at 11:05 UTC
Re^5: uparse - Parse Unicode strings by eyepopslikeamosquito (Archbishop) on Nov 19, 2023 at 12:11 UTC
Some notes below your chosen depth have not been shown here
Decoding @ARGV [Was: uparse - Parse Unicode strings] by jo37 (Deacon) on Nov 22, 2023 at 20:38 UTC
Hi Ken! Tried to find a general solution to the problem reported in Re: uparse - Parse Unicode strings. Short explanation of the problem: There are two basic ways to get correct UNICODE input from the elements in `@ARGV`: implicit decoding with a runtime option `-CA` or an environment setting `PERL_UNICODE=A` explicit decoding using `Encode::decode` Either may be used, but not both. A script that expects UNICODE data from `@ARGV` cannot easily detect if the implicit decoding is in effect, especially because `-CAL` makes the behaviour locale-dependent. The best solution I could find is to check if the data in question is already marked to be in UTF-8. `Encode::is_utf8` (or the equivalent `utf8::is_utf8`) may be used to check this flag, which results in a small modification to your script: `diff --git a/uparse b/uparse index f5edb92..b05e12a 100755 --- a/uparse +++ b/uparse @@ -23,11 +23,11 @@ use constant { NO_PRINT => "\N{REPLACEMENT CHARACTER}", }; -use Encode 'decode'; +use Encode qw(decode is_utf8); use Unicode::UCD 'charinfo'; for my $raw_str (@ARGV) { - my $str = decode('UTF-8', $raw_str); + my $str = is_utf8($raw_str) ? $raw_str : decode('UTF-8', $raw_str +); print "\n", SEP1; print "String: '$str'\n"; print SEP1;` [download] What do you think about this? Greetings, -jo `$gryYup$d0ylprbpriprrYpkJl2xyl~rzg??P~5lp2hyl0p$`	[reply] [d/l] [select]
Re: Decoding @ARGV [Was: uparse - Parse Unicode strings] by kcott (Archbishop) on Nov 23, 2023 at 07:14 UTC
++ Thanks for your analysis and patch. I had planned, assuming I had sufficient time to spare, to have a further look at `uparse` this weekend and get it to work on different platforms, and in various environments. What you've provided is a good start and helps a lot. — Ken	[reply] [d/l]
Re: Decoding @ARGV [Was: uparse - Parse Unicode strings] by kcott (Archbishop) on Nov 24, 2023 at 23:45 UTC
[A follow-up to "Re: Decoding @ARGV [Was: uparse - Parse Unicode strings]".] I'm not going to have sufficient spare time to do all that I wanted this weekend. I have managed to incorporate your changes and do a couple of other minor things. When prefixing the `uparse` command with `PERL_UNICODE=A` or `PERL_UNICODE=SDAL`, I just get "`Wide character at ...`" and no other output. I made these changes: Added changes from your patch. Changed "`use open IO ...`" to "`use open OUT ...`". Modified the code layout (mostly to avoid wrapping in PM). Here's the new code: #!/usr/bin/env perl BEGIN { if ($] < 5.007003) { warn "$0 requires Perl v5.7.3 or later.\n"; exit; } unless (@ARGV) { warn "Usage: $0 string [string ...]\n"; exit; } } use 5.007003; use strict; use warnings; use open OUT => qw{:encoding(UTF-8) :std}; use constant { SEP1 => '=' x 60 . "\n", SEP2 => '-' x 60 . "\n", FMT => "%s\tU+%-6X %s\n", NO_PRINT => "\N{REPLACEMENT CHARACTER}", }; use Encode qw{decode is_utf8}; use Unicode::UCD 'charinfo'; for my $raw_str (@ARGV) { my $str = is_utf8($raw_str) ? $raw_str : decode('UTF-8', $raw_str); print "\n", SEP1; print "String: '$str'\n"; print SEP1; for my $char (split //, $str) { my $code_point = ord $char; my $char_info = charinfo($code_point); if (! defined $char_info) { $char_info->{name} = "<unknown> Perl $^V supports Unicode " . Unicode::UCD::UnicodeVersion(); } printf FMT, ($char =~ /^\p{Print}$/ ? $char : NO_PRINT), $code_point, $char_info->{name}; } print SEP2; } [download] Here's a test run with just `uparse`: $ uparse 👮🏼 👮🏼‍♀️ 👮🏼‍♂️ ============================================================ String: '👮🏼' ============================================================ 👮 U+1F46E POLICE OFFICER 🏼 U+1F3FC EMOJI MODIFIER FITZPATRICK TYPE-3 ------------------------------------------------------------ ============================================================ String: '👮🏼‍♀️' ============================================================ 👮 U+1F46E POLICE OFFICER 🏼 U+1F3FC EMOJI MODIFIER FITZPATRICK TYPE-3 U+200D ZERO WIDTH JOINER ♀ U+2640 FEMALE SIGN U+FE0F VARIATION SELECTOR-16 ------------------------------------------------------------ ============================================================ String: '👮🏼‍♂️' ============================================================ 👮 U+1F46E POLICE OFFICER 🏼 U+1F3FC EMOJI MODIFIER FITZPATRICK TYPE-3 U+200D ZERO WIDTH JOINER ♂ U+2642 MALE SIGN U+FE0F VARIATION SELECTOR-16 ------------------------------------------------------------ And again, this time with `PERL_UNICODE=A`: $ PERL_UNICODE=A uparse 👮🏼 👮🏼‍♀️ 👮🏼‍♂️ ============================================================ String: '👮🏼' ============================================================ 👮 U+1F46E POLICE OFFICER 🏼 U+1F3FC EMOJI MODIFIER FITZPATRICK TYPE-3 ------------------------------------------------------------ ============================================================ String: '👮🏼‍♀️' ============================================================ 👮 U+1F46E POLICE OFFICER 🏼 U+1F3FC EMOJI MODIFIER FITZPATRICK TYPE-3 U+200D ZERO WIDTH JOINER ♀ U+2640 FEMALE SIGN U+FE0F VARIATION SELECTOR-16 ------------------------------------------------------------ ============================================================ String: '👮🏼‍♂️' ============================================================ 👮 U+1F46E POLICE OFFICER 🏼 U+1F3FC EMOJI MODIFIER FITZPATRICK TYPE-3 U+200D ZERO WIDTH JOINER ♂ U+2642 MALE SIGN U+FE0F VARIATION SELECTOR-16 ------------------------------------------------------------ Using "`PERL_UNICODE=SDAL`" gives the same output as "`PERL_UNICODE=A`". — Ken	[reply] [d/l] [select]
Re: uparse - Parse Unicode strings by jo37 (Deacon) on Nov 19, 2023 at 21:52 UTC
I don't know what is wrong with my locale setup. Neither `uparse` nor `uchar` work on my old perl 5.032001 on Debian 11. $ ./uparse.pl �� ============================================================ String: '��' ============================================================ � U+FFFD REPLACEMENT CHARACTER � U+FFFD REPLACEMENT CHARACTER � U+FFFD REPLACEMENT CHARACTER ------------------------------------------------------------ $ ./uchar.pl -v �� U0fffd \N{REPLACEMENT CHARACTER} � U0fffd \N{REPLACEMENT CHARACTER} � U0fffd \N{REPLACEMENT CHARACTER} Removing `decode` from `uparse.pl` resolves the problem: `$ ./uparse.pl �� ============================================================ String: '��' ============================================================ � U+E4 LATIN SMALL LETTER A WITH DIAERESIS � U+F6 LATIN SMALL LETTER O WITH DIAERESIS � U+FC LATIN SMALL LETTER U WITH DIAERESIS ------------------------------------------------------------` [download] Greetings, -jo `$gryYup$d0ylprbpriprrYpkJl2xyl~rzg??P~5lp2hyl0p$`	[reply] [d/l] [select]
Re^2: uparse - Parse Unicode strings by kcott (Archbishop) on Nov 20, 2023 at 00:19 UTC
Thanks for the feedback. I don't have a Debian available; I'm running Cygwin with Perlbrew and was able to wind back to `v5.32.0` (the closest I have to your `v5.32.1`). Under that version I have Unicode::UCD 0.75 and Encode 3.06 — what do you have? Here's a few tests. `$ perl -v \| head -2 \| tail -1 This is perl 5, version 32, subversion 0 (v5.32.0) built for cygwin-th +read-multi` [download] I saw the three vowels (`WITH DIAERESIS`) on the web page. They didn't change when I pasted them onto my command line; nor in the `uparse` output. However, when I pasted the results back here: $ uparse äöü ============================================================ String: 'äöü' ============================================================ ä U+E4 LATIN SMALL LETTER A WITH DIAERESIS ö U+F6 LATIN SMALL LETTER O WITH DIAERESIS ü U+FC LATIN SMALL LETTER U WITH DIAERESIS ------------------------------------------------------------ And just so that you know what I'm seeing: $ uparse äöü ============================================================ String: 'äöü' ============================================================ � U+C3 LATIN CAPITAL LETTER A WITH TILDE � U+A4 CURRENCY SIGN � U+C3 LATIN CAPITAL LETTER A WITH TILDE � U+B6 PILCROW SIGN � U+C3 LATIN CAPITAL LETTER A WITH TILDE � U+BC VULGAR FRACTION ONE QUARTER ------------------------------------------------------------ There were no surprises with my other tests. $ uparse �� ============================================================ String: '��' ============================================================ � U+FFFD REPLACEMENT CHARACTER � U+FFFD REPLACEMENT CHARACTER � U+FFFD REPLACEMENT CHARACTER ------------------------------------------------------------ $ uparse 👨‍🦳‍👧‍👦 ============================================================ String: '👨‍🦳‍👧‍👦' ============================================================ 👨 U+1F468 MAN U+200D ZERO WIDTH JOINER 🦳 U+1F9B3 EMOJI COMPONENT WHITE HAIR U+200D ZERO WIDTH JOINER 👧 U+1F467 GIRL U+200D ZERO WIDTH JOINER 👦 U+1F466 BOY ------------------------------------------------------------ $ uparse 👨🏽‍✈️ ============================================================ String: '👨🏽‍✈️' ============================================================ 👨 U+1F468 MAN 🏽 U+1F3FD EMOJI MODIFIER FITZPATRICK TYPE-4 U+200D ZERO WIDTH JOINER ✈ U+2708 AIRPLANE U+FE0F VARIATION SELECTOR-16 ------------------------------------------------------------ $ uparse X🩼X ============================================================ String: 'X🩼X' ============================================================ X U+58 LATIN CAPITAL LETTER X � U+1FA7C <unknown> Perl v5.32.0 supports Unicode 13.0.0 X U+58 LATIN CAPITAL LETTER X ------------------------------------------------------------ $ uparse `perl -C -e 'print "X\x{1fa7d}X"'` ============================================================ String: 'X🩽X' ============================================================ X U+58 LATIN CAPITAL LETTER X � U+1FA7D <unknown> Perl v5.32.0 supports Unicode 13.0.0 X U+58 LATIN CAPITAL LETTER X ------------------------------------------------------------ You mentioned "locale setup" but didn't say what you have. I have: `LANG=en_AU.UTF-8 LC_ALL=en_AU.UTF-8 LC_COLLATE=en_AU.UTF-8 LC_CTYPE=en_AU.UTF-8 LC_MESSAGES=en_AU.UTF-8 LC_MONETARY=en_AU.UTF-8 LC_NUMERIC=en_AU.UTF-8 LC_TIME=en_AU.UTF-8` [download] That's the best I can do. Perhaps someone with the same O/S and Perl version as you can shed more light on your problem. — Ken	[reply] [d/l] [select]
Re^2: uparse - Parse Unicode strings [sanity test] by kcott (Archbishop) on Nov 20, 2023 at 09:30 UTC
Just as something of a sanity test for me, and perhaps a test you could try for yourself, here's `uparse` with its argument taken from three different sources not associated with PerlMonks. `$ perlbrew switch perl-5.32.0 $ perl -v \| head -2 \| tail -1 This is perl 5, version 32, subversion 0 (v5.32.0) built for cygwin-th +read-multi` [download] Copy-pasted from the Unicode PDF code chart "C1 Controls and Latin-1 Supplement (Range: 0080-00FF)": $ uparse �� ============================================================ String: '��' ============================================================ � U+E4 LATIN SMALL LETTER A WITH DIAERESIS � U+F6 LATIN SMALL LETTER O WITH DIAERESIS � U+FC LATIN SMALL LETTER U WITH DIAERESIS ------------------------------------------------------------ Generated directly from a `perl` command: $ uparse `perl -C -e 'print "\x{e4}\x{f6}\x{fc}"'` ============================================================ String: '��' ============================================================ � U+E4 LATIN SMALL LETTER A WITH DIAERESIS � U+F6 LATIN SMALL LETTER O WITH DIAERESIS � U+FC LATIN SMALL LETTER U WITH DIAERESIS ------------------------------------------------------------ Generated separately then copy-pasted as an argument to `uparse`: $ perl -C -e 'print "\N{LATIN SMALL LETTER A WITH DIAERESIS}\N{LATIN SMALL LETTER O WITH DIAERESIS}\N{LATIN SMALL LETTER U WITH DIAERESIS}"' �� $ uparse �� ============================================================ String: '��' ============================================================ � U+E4 LATIN SMALL LETTER A WITH DIAERESIS � U+F6 LATIN SMALL LETTER O WITH DIAERESIS � U+FC LATIN SMALL LETTER U WITH DIAERESIS ------------------------------------------------------------ — Ken	[reply] [d/l] [select]
Re^3: uparse - Parse Unicode strings [sanity test] by jo37 (Deacon) on Nov 20, 2023 at 11:14 UTC
Hi Ken! I found the reason for the strange behaviour: I didn't even remember, but I have PERL_UNICODE=SDAL set. Without this variable the script works correctly. More specifically, it's the "A" in it. From perlrun: `A 32 the @ARGV elements are expected to be strings encoded in UTF-8` [download] Thank you very much for your investigations! Greetings, -jo `$gryYup$d0ylprbpriprrYpkJl2xyl~rzg??P~5lp2hyl0p$`	[reply] [d/l]
Re^2: uparse - Parse Unicode strings by ikegami (Patriarch) on Nov 20, 2023 at 14:45 UTC
~~The script assumes your terminal uses UTF-8. However, you are not using a UTF-8 locale. You should look into switching to a UTF-8 locale.~~ I didn't notice there were other comments already.	[reply]


Your skill will accomplish what the force of many cannot
	PerlMonks