Just another Perl shrine | |
PerlMonks |
Re: create clone script for utf8 encodingby Anonymous Monk |
on Dec 15, 2018 at 08:17 UTC ( [id://1227292]=note: print w/replies, xml ) | Need Help?? |
I'm running ubuntu with bash, and when I touch a file into existence, it is us-ascii. Likewise, files that are formed from redirecting STDOUT begin their lives as us-ascii on this platform. Where is this determined on POSIX systems? Strictly speaking, that depends both on the program that created the file and your interpretation of it. For example, printf '\xf0\xd2\xc9\xd7\xc5\xd4\n' > file would create a file filled with bytes, that, when interpreted as KOI8-R (iconv -f koi8-r file), would translate to a greeting in Russian. It just so happens that when you type text on your English keyboard and it's encoded into bytes according to the rules defined by your locale, its UTF8-encoded bytes (Ubuntu has been UTF-8 by default for years) have the same meaning if you decode them as ASCII. UTF-8 has been designed to be "backwards compatible" to ASCII when it comes to the first 128 code points. Does ascii have a representation for Ü? No. If you consult the ASCII table, you will see that it only defines glyphs corresponding to byte values 0..127. With 26*2 letters + 10 digits + 32 control characters to be interpreted by teletypes (or terminal emulators) there is only enough space for some punctuation marks, but no accented characters. Single-byte encodings like ISO-8859-1 or KOI8-R use the byte values 128..255 for that. If you run file 2.ascii.de.txt, you will see that it's actually UTF-8. file can also discern pure ASCII files - because they don't have any bytes above 127 - but cannot discern different single-byte non-ASCII encodings. Those can contain any byte values, and you have to know statistics about the languages used for those encodings to guess - not 100% right - which language and which encoding it is. UTF-8 can also contain any byte values, but the bytes always follow specific rules which can be easily checked. Finally, what makes any of these en_**.utf8 encodings different from another? Those are locales, not encodings. The encoding specified by most of the locales is UTF-8, but the underlying locale settings like number format (decimal dot or comma?), date-time format (Y-m-d or m/d/y? 12 hours or 24 hours?), string collation rules (yes, the way we sort strings depends on the language they are in), etc) are different. Удачи,
In Section
Seekers of Perl Wisdom
|
|