Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

Re: uparse - Parse Unicode strings

by jo37 (Deacon)
on Nov 19, 2023 at 21:52 UTC ( [id://11155703]=note: print w/replies, xml ) Need Help??


in reply to uparse - Parse Unicode strings

I don't know what is wrong with my locale setup. Neither uparse nor uchar work on my old perl 5.032001 on Debian 11.

$ ./uparse.pl äöü

============================================================
String: '���'
============================================================
�	U+FFFD   REPLACEMENT CHARACTER
�	U+FFFD   REPLACEMENT CHARACTER
�	U+FFFD   REPLACEMENT CHARACTER
------------------------------------------------------------
$ ./uchar.pl -v äöü
� U0fffd \N{REPLACEMENT CHARACTER}
� U0fffd \N{REPLACEMENT CHARACTER}
� U0fffd \N{REPLACEMENT CHARACTER}

Removing decode from uparse.pl resolves the problem:

$ ./uparse.pl äöü ============================================================ String: 'äöü' ============================================================ ä U+E4 LATIN SMALL LETTER A WITH DIAERESIS ö U+F6 LATIN SMALL LETTER O WITH DIAERESIS ü U+FC LATIN SMALL LETTER U WITH DIAERESIS ------------------------------------------------------------

Greetings,
-jo

$gryYup$d0ylprbpriprrYpkJl2xyl~rzg??P~5lp2hyl0p$

Replies are listed 'Best First'.
Re^2: uparse - Parse Unicode strings
by kcott (Archbishop) on Nov 20, 2023 at 00:19 UTC

    Thanks for the feedback. I don't have a Debian available; I'm running Cygwin with Perlbrew and was able to wind back to v5.32.0 (the closest I have to your v5.32.1). Under that version I have Unicode::UCD 0.75 and Encode 3.06 — what do you have? Here's a few tests.

    $ perl -v | head -2 | tail -1 This is perl 5, version 32, subversion 0 (v5.32.0) built for cygwin-th +read-multi

    I saw the three vowels (WITH DIAERESIS) on the web page. They didn't change when I pasted them onto my command line; nor in the uparse output. However, when I pasted the results back here:

    $ uparse äöü
    
    ============================================================
    String: 'äöü'
    ============================================================
    ä       U+E4     LATIN SMALL LETTER A WITH DIAERESIS
    ö       U+F6     LATIN SMALL LETTER O WITH DIAERESIS
    ü       U+FC     LATIN SMALL LETTER U WITH DIAERESIS
    ------------------------------------------------------------
    

    And just so that you know what I'm seeing:

    $ uparse äöü
    
    ============================================================
    String: 'äöü'
    ============================================================
    Ã       U+C3     LATIN CAPITAL LETTER A WITH TILDE
    ¤       U+A4     CURRENCY SIGN
    Ã       U+C3     LATIN CAPITAL LETTER A WITH TILDE
    ¶       U+B6     PILCROW SIGN
    Ã       U+C3     LATIN CAPITAL LETTER A WITH TILDE
    ¼       U+BC     VULGAR FRACTION ONE QUARTER
    ------------------------------------------------------------
    

    There were no surprises with my other tests.

    $ uparse ���
    
    ============================================================
    String: '���'
    ============================================================
    �       U+FFFD   REPLACEMENT CHARACTER
    �       U+FFFD   REPLACEMENT CHARACTER
    �       U+FFFD   REPLACEMENT CHARACTER
    ------------------------------------------------------------
    
    $ uparse 👨‍🦳‍👧‍👦
    
    ============================================================
    String: '👨‍🦳‍👧‍👦'
    ============================================================
    👨      U+1F468  MAN
            U+200D   ZERO WIDTH JOINER
    🦳      U+1F9B3  EMOJI COMPONENT WHITE HAIR
            U+200D   ZERO WIDTH JOINER
    👧      U+1F467  GIRL
            U+200D   ZERO WIDTH JOINER
    👦      U+1F466  BOY
    ------------------------------------------------------------
    
    $ uparse 👨🏽‍✈️
    
    ============================================================
    String: '👨🏽‍✈️'
    ============================================================
    👨      U+1F468  MAN
    🏽      U+1F3FD  EMOJI MODIFIER FITZPATRICK TYPE-4
            U+200D   ZERO WIDTH JOINER
    ✈       U+2708   AIRPLANE
            U+FE0F   VARIATION SELECTOR-16
    ------------------------------------------------------------
    
    $ uparse X🩼X
    
    ============================================================
    String: 'X🩼X'
    ============================================================
    X       U+58     LATIN CAPITAL LETTER X
    �       U+1FA7C  <unknown> Perl v5.32.0 supports Unicode 13.0.0
    X       U+58     LATIN CAPITAL LETTER X
    ------------------------------------------------------------
    
    $ uparse `perl -C -e 'print "X\x{1fa7d}X"'`
    
    ============================================================
    String: 'X🩽X'
    ============================================================
    X       U+58     LATIN CAPITAL LETTER X
    �       U+1FA7D  <unknown> Perl v5.32.0 supports Unicode 13.0.0
    X       U+58     LATIN CAPITAL LETTER X
    ------------------------------------------------------------
    

    You mentioned "locale setup" but didn't say what you have. I have:

    LANG=en_AU.UTF-8 LC_ALL=en_AU.UTF-8 LC_COLLATE=en_AU.UTF-8 LC_CTYPE=en_AU.UTF-8 LC_MESSAGES=en_AU.UTF-8 LC_MONETARY=en_AU.UTF-8 LC_NUMERIC=en_AU.UTF-8 LC_TIME=en_AU.UTF-8

    That's the best I can do. Perhaps someone with the same O/S and Perl version as you can shed more light on your problem.

    — Ken

Re^2: uparse - Parse Unicode strings [sanity test]
by kcott (Archbishop) on Nov 20, 2023 at 09:30 UTC

    Just as something of a sanity test for me, and perhaps a test you could try for yourself, here's uparse with its argument taken from three different sources not associated with PerlMonks.

    $ perlbrew switch perl-5.32.0 $ perl -v | head -2 | tail -1 This is perl 5, version 32, subversion 0 (v5.32.0) built for cygwin-th +read-multi

    Copy-pasted from the Unicode PDF code chart "C1 Controls and Latin-1 Supplement (Range: 0080-00FF)":

    $ uparse äöü
    
    ============================================================
    String: 'äöü'
    ============================================================
    ä       U+E4     LATIN SMALL LETTER A WITH DIAERESIS
    ö       U+F6     LATIN SMALL LETTER O WITH DIAERESIS
    ü       U+FC     LATIN SMALL LETTER U WITH DIAERESIS
    ------------------------------------------------------------
    

    Generated directly from a perl command:

    $ uparse `perl -C -e 'print "\x{e4}\x{f6}\x{fc}"'`
    
    ============================================================
    String: 'äöü'
    ============================================================
    ä       U+E4     LATIN SMALL LETTER A WITH DIAERESIS
    ö       U+F6     LATIN SMALL LETTER O WITH DIAERESIS
    ü       U+FC     LATIN SMALL LETTER U WITH DIAERESIS
    ------------------------------------------------------------
    

    Generated separately then copy-pasted as an argument to uparse:

    $ perl -C -e 'print "\N{LATIN SMALL LETTER A WITH DIAERESIS}\N{LATIN SMALL LETTER O WITH DIAERESIS}\N{LATIN SMALL LETTER U WITH DIAERESIS}"'
    äöü
    
    $ uparse äöü
    
    ============================================================
    String: 'äöü'
    ============================================================
    ä       U+E4     LATIN SMALL LETTER A WITH DIAERESIS
    ö       U+F6     LATIN SMALL LETTER O WITH DIAERESIS
    ü       U+FC     LATIN SMALL LETTER U WITH DIAERESIS
    ------------------------------------------------------------
    

    — Ken

      Hi Ken!

      I found the reason for the strange behaviour: I didn't even remember, but I have PERL_UNICODE=SDAL set. Without this variable the script works correctly. More specifically, it's the "A" in it. From perlrun:

      A 32 the @ARGV elements are expected to be strings encoded in UTF-8

      Thank you very much for your investigations!

      Greetings,
      -jo

      $gryYup$d0ylprbpriprrYpkJl2xyl~rzg??P~5lp2hyl0p$
Re^2: uparse - Parse Unicode strings
by ikegami (Patriarch) on Nov 20, 2023 at 14:45 UTC

    The script assumes your terminal uses UTF-8. However, you are not using a UTF-8 locale. You should look into switching to a UTF-8 locale.

    I didn't notice there were other comments already.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11155703]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others studying the Monastery: (5)
As of 2024-05-19 06:40 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found