Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

Re^3: german Alphabet

by haukex (Archbishop)
on Dec 04, 2018 at 21:02 UTC ( [id://1226742]=note: print w/replies, xml ) Need Help??


in reply to Re^2: german Alphabet
in thread german Alphabet

I don't see in ikegami's script the need for use utf8;.

The OP as well as ikegami's script contain the string 'Fräsen und ndk (Kamera - Fräsaufnahme)'. From utf8: "The use utf8 pragma tells the Perl parser to allow UTF-8 in the program text in the current lexical scope. ... Do not use this pragma for anything else than telling Perl that your script is written in UTF-8. ... Because it is not possible to reliably tell UTF-8 from native 8 bit encodings, you need either a Byte Order Mark at the beginning of your source code, or use utf8;, to instruct perl."

Although the "ä" may happen appear to work because it's part of the Latin1 character set, which Perl typically uses internally, it will most likely not do what you want on any Unicode characters outside of that set. As you can see below, the only version of the code in which the UTF8 is flag properly set on the string is the one where the source is encoded as UTF-8 and use utf8; is used. The rule of thumb I always use is to either work entirely in ASCII (using escapes such as \N{} to specify Unicode characters), or otherwise use a UTF-8 encoding on the source code and use utf8;. See also perluniintro and perlunicode.

$ cat with_utf8.pl use warnings; use strict; use utf8; use Devel::Peek; my $string = 'Fräsen und ndk (Kamera - Fräsaufnahme)'; Dump($string); $ perl -pe 's/^(?=.*utf8)/#/' with_utf8.pl | tee without_utf8.pl use warnings; use strict; #use utf8; use Devel::Peek; my $string = 'Fräsen und ndk (Kamera - Fräsaufnahme)'; Dump($string); $ iconv -f UTF-8 -t Latin1 without_utf8.pl -o latin1.pl $ file -i *.pl latin1.pl: text/plain; charset=iso-8859-1 without_utf8.pl: text/plain; charset=utf-8 with_utf8.pl: text/plain; charset=utf-8 $ perl latin1.pl SV = PV(0x1365d70) at 0x13855c0 REFCNT = 1 FLAGS = (POK,IsCOW,pPOK) PV = 0x13d7160 "Fr\344sen und ndk (Kamera - Fr\344saufnahme)"\0 CUR = 38 LEN = 40 COW_REFCNT = 1 $ perl without_utf8.pl SV = PV(0xa15d70) at 0xa355c0 REFCNT = 1 FLAGS = (POK,IsCOW,pPOK) PV = 0xa87190 "Fr\303\244sen und ndk (Kamera - Fr\303\244saufnahme)" +\0 CUR = 40 LEN = 42 COW_REFCNT = 1 $ perl with_utf8.pl SV = PV(0x18d5d70) at 0x18f55d8 REFCNT = 1 FLAGS = (POK,IsCOW,pPOK,UTF8) PV = 0x19384a0 "Fr\303\244sen und ndk (Kamera - Fr\303\244saufnahme) +"\0 [UTF8 "Fr\x{e4}sen und ndk (Kamera - Fr\x{e4}saufnahme)"] CUR = 40 LEN = 42 COW_REFCNT = 1

Updated as per ikegami's reply.

Replies are listed 'Best First'.
Re^4: german Alphabet
by ikegami (Patriarch) on Dec 05, 2018 at 09:19 UTC

    Perl assumes ASCII, not latin-1.

    $ perl -Mutf8 -MEncode -e'print encode("latin-1", "sub fête {}\n");' \ | perl Illegal declaration of subroutine main::f at - line 1.

    If you happen to use an 8-bit byte in string literal, a character with the value of the byte will be created rather than throwing an error.

      It might be important to note that when one tries to print a wide string that happens to be representable in latin-1, Perl uses latin-1 with no warnings:
      $ perl -w -Mutf8 -E'print "ê"' | hd 00000000 ea |.| 00000001
      "ê" is decoded into characters but then printed to a handle that doesn't have an :encode(...) or :utf8 IOLayer. Since it's representable in latin-1, the single-byte encoding is used and no warning is shown.
      $ perl -w -Mutf8 -E'print "ы"' | hd
      Wide character in print at -e line 1.
      00000000  d1 8b                                             |..|
      00000002
      
      Similar situation, but "ы" cannot be represented in latin-1, so we get a warning and UTF-8 bytes instead.
      $ perl -w -E'print "ê"' | hd 00000000 c3 aa |..| 00000002
      (My terminal is UTF-8. No decoding or encoding is done in this case, Perl operates on bytes.)

        No. Perl never uses latin-1.

        In the first case (print "\xEA";), Perl is expecting bytes, and you provided a string of bytes, so it printed the bytes (as-is). It didn't warn because you provided what was expected.

        In the second case (print "\x{44B}";), Perl is expecting bytes, and you didn't provided a string of bytes, so it guesses that you meant to encode them using UTF-8, does so, and warns.

        In the third case (print "\xC3\xAA";), Perl is expecting bytes, and you provided a string of bytes, so it printed the bytes (as-is). It didn't warn because you provided what was expected.

        (A string a bytes is a string consisting of entirely characters with a value less than 256.)

Re^4: german Alphabet
by Aldebaran (Curate) on Dec 07, 2018 at 23:44 UTC
    ...use a UTF-8 encoding on the source code and use utf8;

    I have found this to be a very informative thread and ikegami's comments illuminating. Some issues require comment with attending source, so that others can replicate. I have enjoyed replicating haukex's source and wicked use of the command line to clone a script with a use statement commented out. That said, I don't understand current output.

    I use my clone tool on haukex's script to get a filename in my nomenclature:

    $ ./2.create.bash with_utf8.pl The shebang is specifying bash Using bash 4.4.19(1)-release 1 1.pl -rwxr-xr-x 1 bob bob 125 Dec 6 11:54 1.pl $ file -i *.pl ... 1.pl: text/x-perl; charset=utf-8 2.excel.pl: text/x-perl; charset=us-ascii ... 5.ping4.pl: text/x-perl; charset=us-ascii 6.excel.pl: text/x-perl; charset=us-ascii latin1.pl: text/plain; charset=iso-8859-1 without_utf8.pl: text/plain; charset=utf-8 with_utf8.pl: text/plain; charset=utf-8 $

    I then use his nifty shell command:

    $ perl -pe 's/^(?=.*utf8)/#/' 1.pl | tee 1.without_utf8.pl #!/usr/bin/perl -w use 5.011; #use utf8; use Devel::Peek; my $string = 'Gödel'; Dump($string); $string = 'über'; Dump($string); $string = 'alleß'; Dump($string); $

    The original is unable to render the special charcters in STDOUT. Uncertain what happens in code tags:

    $ ./1.pl string is G�del SV = PV(0x556af89dcda0) at 0x556af8a06fa0 REFCNT = 1 FLAGS = (POK,IsCOW,pPOK,UTF8) PV = 0x556af8a2d730 "G\303\266del"\0 [UTF8 "G\x{f6}del"] CUR = 6 LEN = 10 COW_REFCNT = 1 string is �ber SV = PV(0x556af89dcda0) at 0x556af8a06fa0 REFCNT = 1 FLAGS = (POK,IsCOW,pPOK,UTF8) PV = 0x556af8a124e0 "\303\274ber"\0 [UTF8 "\x{fc}ber"] CUR = 5 LEN = 10 COW_REFCNT = 1 string is alle� SV = PV(0x556af89dcda0) at 0x556af8a06fa0 REFCNT = 1 FLAGS = (POK,IsCOW,pPOK,UTF8) PV = 0x556af89f9eb0 "alle\303\237"\0 [UTF8 "alle\x{df}"] CUR = 6 LEN = 10 COW_REFCNT = 1 $

    BUT, (this part is surprising to me), the umlauts are legible in STDOUT for the version with use utf8 commented out. They will probably get shredded in code tags:

    $ ./1.without_utf8.pl string is Gödel SV = PV(0x5584c04dbda0) at 0x5584c0505a88 REFCNT = 1 FLAGS = (POK,IsCOW,pPOK) PV = 0x5584c052d400 "G\303\266del"\0 CUR = 6 LEN = 10 COW_REFCNT = 1 string is über SV = PV(0x5584c04dbda0) at 0x5584c0505a88 REFCNT = 1 FLAGS = (POK,IsCOW,pPOK) PV = 0x5584c0515040 "\303\274ber"\0 CUR = 5 LEN = 10 COW_REFCNT = 1 string is alleß SV = PV(0x5584c04dbda0) at 0x5584c0505a88 REFCNT = 1 FLAGS = (POK,IsCOW,pPOK) PV = 0x5584c050d0d0 "alle\303\237"\0 CUR = 6 LEN = 10 COW_REFCNT = 1 $

    I tried to switch the encoding to us-ascii using a command similar to what you used but fail to find the correct syntax:

    $ iconv -f UTF-8 -t us-ascii 1.pl -o 1.us-ascii.pl iconv: illegal input sequence at position 72 $

    Also, I'm not sure what I'm to be gleaning from Devel::Peek. Is the idea that you get to see what perl's internal representation of a string is?

      You get to see Perl's internal representations of scalars and its "subclasses" (arrays, hashes, globs, etc). See illguts for documentation on these. (Grab the tarball and look at the files named index-*.html or illguts-*.pdf.)

      The transcoding failure is the result of "ö", "ü" and "ß" not being in the US-ASCII character set.

      I tried to switch the encoding to us-ascii using a command similar to what you used but fail to find the correct syntax:
      $ iconv -f UTF-8 -t us-ascii 1.pl -o 1.us-ascii.pl iconv: illegal input sequence at position 72 $

      I finally realized what that was when I took a look inside the unexpected partial file that resulted from this command: 1.us-ascii.pl:

      $ cat 1.ascii.pl #!/usr/bin/perl -w use 5.011; #use utf8; use Devel::Peek; my $string = 'G$

      So, position 72 was where the first umlaut occurred, and now I at least understand the error.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1226742]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others cooling their heels in the Monastery: (5)
As of 2024-03-29 15:06 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found