Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

Re: create clone script for utf8 encoding

by Anonymous Monk
on Dec 15, 2018 at 08:17 UTC ( [id://1227292]=note: print w/replies, xml ) Need Help??


in reply to create clone script for utf8 encoding

I'm running ubuntu with bash, and when I touch a file into existence, it is us-ascii. Likewise, files that are formed from redirecting STDOUT begin their lives as us-ascii on this platform. Where is this determined on POSIX systems?

Strictly speaking, that depends both on the program that created the file and your interpretation of it. For example, printf '\xf0\xd2\xc9\xd7\xc5\xd4\n' > file would create a file filled with bytes, that, when interpreted as KOI8-R (iconv -f koi8-r file), would translate to a greeting in Russian.

It just so happens that when you type text on your English keyboard and it's encoded into bytes according to the rules defined by your locale, its UTF8-encoded bytes (Ubuntu has been UTF-8 by default for years) have the same meaning if you decode them as ASCII. UTF-8 has been designed to be "backwards compatible" to ASCII when it comes to the first 128 code points.

$ iconv -f us-ascii -t UTF-8 2.ascii.de.txt -o 2.de.utf8.txt iconv: illegal input sequence at position 0
Does ascii have a representation for Ü?

No. If you consult the ASCII table, you will see that it only defines glyphs corresponding to byte values 0..127. With 26*2 letters + 10 digits + 32 control characters to be interpreted by teletypes (or terminal emulators) there is only enough space for some punctuation marks, but no accented characters. Single-byte encodings like ISO-8859-1 or KOI8-R use the byte values 128..255 for that.

If you run file 2.ascii.de.txt, you will see that it's actually UTF-8. file can also discern pure ASCII files - because they don't have any bytes above 127 - but cannot discern different single-byte non-ASCII encodings. Those can contain any byte values, and you have to know statistics about the languages used for those encodings to guess - not 100% right - which language and which encoding it is. UTF-8 can also contain any byte values, but the bytes always follow specific rules which can be easily checked.

Finally, what makes any of these en_**.utf8 encodings different from another?

Those are locales, not encodings. The encoding specified by most of the locales is UTF-8, but the underlying locale settings like number format (decimal dot or comma?), date-time format (Y-m-d or m/d/y? 12 hours or 24 hours?), string collation rules (yes, the way we sort strings depends on the language they are in), etc) are different.

Удачи,

Replies are listed 'Best First'.
Re^2: create clone script for utf8 encoding
by Aldebaran (Curate) on Dec 16, 2018 at 22:30 UTC
    Strictly speaking, that depends both on the program that created the file and your interpretation of it. For example, printf '\xf0\xd2\xc9\xd7\xc5\xd4\n' > file would create a file filled with bytes, that, when interpreted as KOI8-R (iconv -f koi8-r file), would translate to a greeting in Russian.

    Спасибо, анонимный монах. I try to run all the posted source on threads where I'm OP, and I was very pleased to run yours and have an iconv command that worked 100 percent. The command gave me a lot of partial credit for failed attempts, which helped diagnose the way. I sense that you are experienced with cyrillic encodings, so I'm very happy to have your attention to my issues, which must seem parochial by your standards.

    $ printf '\xf0\xd2\xc9\xd7\xc5\xd4\n' > 1.file $ iconv -f koi8-r 1.file Привет $ iconv -f koi8-r 1.file -o 1.prubyet $ file 1.prubyet 1.prubyet: UTF-8 Unicode text $ cat 1.prubyet Привет $ cat 1.file ������ $ file 1.file 1.file: ISO-8859 text

    I know how these look to in the terminal and in my editor. 3.file shows the cyrillic greeting. 1.prubyet has six diamonds with question marks in the middle.

    I wondered what diff would think of them:

    $ echo &#1055;&#1088;&#1080;&#1074;&#1077;&#1090; >3.file $ diff 1.file 3.file 1c1 < &#65533;&#65533;&#65533;&#65533;&#65533;&#65533; --- > &#1055;&#1088;&#1080;&#1074;&#1077;&#1090; $

    I'm looking at 1.file and 3.file in the hex editor. 1.file was exactly what I expected, but 3.file has one value more than the 12 I expected. (?)

    D0 9F D1 80 D0 B8 D0 B2 D0 B5 D1 82 0A

    I'd hoped this renders faithfully with monastery code tags. Do I gather that code tags unravel things that aren't us-ascii? Has anyone ever suggested having a form of code tag that did not do this?

      use <pre> instead of <code> for unicode
        use pre instead of code for unicode

        Sure enough...this is a repeat of the diff command with pre tags:

        $ echo Привет >3.file
        $ diff 1.file 3.file
        1c1
        < ������
        ---
        > Привет
        $ 
        
        

        Hmmm, well there it is. I tried pre tags in the writeup but must not have pasted it in and previewed correctly. There is something to learn from seeing the numerical representations of these characters. Indeed, I was surprised that 65533 * 6 was what diff thought 1.file was. It is the unicode replacement character: U+FFFD. Further reading and clarification here: unicode specials

        How did you get single code and pre tags to display (surrounded by <>) and not foul the legibility?

        Also, is there a way to employ the diff command so that the equality in these files could be established? (not essential or vital to this coding task)

      3.file has one value more than the 12 I expected. (?)
      The 0A at the end is the newline, "\n". If you omit it, the shell prompt will be printed on the same line as the text:
      username@localhost:~$ printf '\xf0\xd2\xc9\xd7\xc5\xd4' | iconv -f koi8-r
      Приветusername@localhost:~$
      Together with carriage return "\r", this can be used to produce various effects on the console. For example, the following program prints two different strings, but after it's finished the terminal will look like it didn't print anything:
      perl -e '$|=1; print "Now you see me!"; sleep 1; print "\r"; print "No +w you don\x27t! "; sleep 1; printf "\r"'
      (Actually, you may see part of its output if your shell prompt is short enough. For more honest but less portable version, see man console_codes.)

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1227292]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (5)
As of 2024-04-19 02:17 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found