in reply to Why won't Perl convert (Latin1 | ISO-8859-1) to (UTF-8 | utf8)?

This looks to me like a fundamental misunderstanding of what encoding is, and what encodings exist, and maybe more on this topic as well.

An encoding is just a way to map numbers (whether one byte or more) to glyphs, such as mapping the number 97 to the glyph "a".

Different encodings have different mappings. Not counting unicode encodings (UTF-8, UTF-16, UTF-32, etc., and, yes, there are more) some glyphs appear in more than one encoding, some glyphs appear in different places in different encodings, some glyphs occur in the same place in some encodings (but different in others), some glyphs occur in the same place in every encoding they appear in, and some glyphs appear in the same place in all encodings.

And some glyphs appear in the same place in all encodings and the same place in unicode encodings (possibly with the exception of UTF-7). And that is likely where we are right here.

If you compare the glyphs and their code points for all ordinals under 128 in ISO-88591 against those same code points in UTF-8, you will find that they are bit-for-bit identical. That is, there is no actual way to tell that a UTF-8 file that only uses the code points under 128 as found in ISO-88591 is not actually ISO-88591. Whether you treat it as ISO-88591 or as UTF-8, it doesn't change anything.

So, when you convert from one to the other, you can do so with the "copy" ("cp") command.

(See the conversation in one of my recent threads for another example along the same confusion.)

Your starting file already is UTF-8. If the "file" command can't tell them apart, that's because there is no telling them apart. However, as html, the file command may also use extra heuristics, such as looking for meta tags. So when you change the meta tags, you change the output of file. I don't know if the meta tag was different from the actual encoding if someone would complain, other than your users.

  • Comment on Re: Why won't Perl convert (Latin1 | ISO-8859-1) to (UTF-8 | utf8)?