Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Why won't Perl convert (Latin1 | ISO-8859-1) to (UTF-8 | utf8)?

by taint (Chaplain)
on Jun 05, 2013 at 16:29 UTC ( #1037259=perlquestion: print w/ replies, xml ) Need Help??
taint has asked for the wisdom of the Perl Monks concerning the following question:

Greetings Monksters,
I've been struggling for some time attempting to convert ISO-8859-1 web pages (html files) to utf8 encoded html files.
I've tried iconv(1), which failed. So I tried piconv(1), which also failed
It's not that either can't accomplish the task -- it's that they refuse to perform the task.
Example; I have some ~1,000 .html files I want to convert from Latin1 => utf8:
#!/bin/sh - # 2utf8 (via iconv(1) for i in *.html; do iconv -f iso-8859-1 -t utf8 $i > $i.tmp rm $i mv $i.tmp $i done
The resulting files remain ISO-8859-1. This time with piconv(1):
#!/bin/sh - # 2utf8 (via Piconv(1) for i in *.html; do piconv -f iso-8859-1 -t utf8 $i > $i.tmp rm $i mv $i.tmp $i done
Again, the resultant files remain ISO-8859-1 (Latin1).
All of these files contain the following line within the <head> tags:
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1 +">
Strangley, if I change that line to:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
and perform:
piconv -f ISO-8859-1 -t utf8 index.html > index.htm
The resultant file will be utf8 encoded.
Is this a bug in Perl?

Any, and all help with this greatly appreciated.

Thank you for all your time, and consideration.

--chris

#!/usr/bin/perl -Tw
use perl::always;
my $perl_version = "5.12.4";
print $perl_version;

Comment on Why won't Perl convert (Latin1 | ISO-8859-1) to (UTF-8 | utf8)?
Select or Download Code
Re: Why won't Perl convert (Latin1 | ISO-8859-1) to (UTF-8 | utf8)?
by space_monk (Chaplain) on Jun 05, 2013 at 16:45 UTC
    How are you determining what encoding is used by the end file? If you're loading them into a web browser then the content type will be determined by the meta information. AFAIK the content type of a text file is what you say it is, not detected.
    If you spot any bugs in my solutions, it's because I've deliberately left them in as an exercise for the reader! :-)
Re: Why won't Perl convert (Latin1 | ISO-8859-1) to (UTF-8 | utf8)?
by Anonymous Monk on Jun 05, 2013 at 17:12 UTC

    ... also failed .. refuse to perform the task...

    How do you determine this?

    meta ..

    iconv doesn't consult meta at all, meta is for the browsers

    Why won't Perl convert (Latin1 | ISO-8859-1) to (UTF-8 | utf8)?

    A poor workman blames his tools

      A poor workman blames his tools
      An ignorant person makes assumptions.

      #!/usr/bin/perl -Tw
      use perl::always;
      my $perl_version = "5.12.4";
      print $perl_version;

        A poor workman blames his tools
        An ignorant person makes assumptions.

        Never heard of that proverb :) and FWIW, I don't see where I made any assumptions, you ask "why is perl to blame" but you don't show that it is, you simply say that it is :)

        OTOH, every time you post a quesiton/response you get a link to How do I post a question effectively? which says to show some data that demonstrates the problem, you could use  perl -MData::Dump -MFile::Slurp -e " dd scalar read_file shift, { qw/ binmode :raw / }; "  tenlinefile > tenlinefileasperl.pl

Re: Why won't Perl convert (Latin1 | ISO-8859-1) to (UTF-8 | utf8)?
by kcott (Abbot) on Jun 05, 2013 at 18:09 UTC

    G'day taint,

    piconv converts character encodings. Here's an example of ISO-8859-1 to UTF-8 and back again (using the copyright sign):

    $ piconv -f ISO-8859-1 -t utf8 -s '' © $ piconv -t ISO-8859-1 -f utf8 -s '©'

    piconv does not look for keys such as "charset" or "encoding" and attempt to change their values.

    Also, all the characters in the string "iso-8859-1" are ASCII; their values are identical to the Unicode code points of the corresponding characters. Had that meta element contained non-ASCII characters, you would have seen some conversion.

    $ piconv -f ISO-8859-1 -t utf8 \ > -s '<meta http-equiv="Content-Type" content="text/html; charset= +iso-8859-1">' <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1 +"> $ piconv -f ISO-8859-1 -t utf8 \ > -s '<meta name="registered sign" content="">' <meta name="registered sign" content="®">

    To convert your HTML files, you'll need to run piconv and also change "iso-8859-1" references to "utf-8". Be aware that there are several places in which encodings might be specified: for instance, meta and script elements may contain a charset attribute and XHTML documents may include encoding attributes.

    -- Ken

      G'day to you too, mate!
      I'd like to preface my response, by letting you know that I really appreciate all the time, and effort you put into all your responses -- +2 to you.
      As to the iso-8859-1 => utf8 issue I'm having, and your reply...
      I'm keen on the points you've made. I do recognize that the action(s) that iconv(1) && piconv(1) peform upon their subject files, do not read such html tags as <meta http-equiv="content-type" content="application/xhtml+xml; charset=iso-8859-1 || utf-8" />
      Which is why I was so puzzled as to why the file(s) would take on the requested "MimeType" after changing that line.
      I'm wondering if it wouldn't make more sense for me to attempt to create my own "converter" utilizing Encode.pm -- which I believe piconv(1) uses anyway.

      Thanks again, for taking the time to respond.

      --chris

      #!/usr/bin/perl -Tw
      use perl::always;
      my $perl_version = "5.12.4";
      print $perl_version;
        Which is why I was so puzzled as to why the file(s) would take on the requested "MimeType" after changing that line.

        From my experiments (and the documentation), file can only guess at the encoding by the presence or absence of specific codepoints. If your file has only those UTF-8 codepoints which also fit the Latin-1 encoding, perhaps it's all too happy to report the file as Latin-1. In my experience, iconv doesn't add a UTF-8 BOM to the start of the file. When I did that with Vim, file was a lot more specific.

        Thanks for your complimentary remarks — they are appreciated.

        piconv does use Encode. It's also relatively short: if you ignore the option handling, POD, etc., you're left with probably less than 100 lines of code. So, if you wanted to use that as a starting point to roll your own version, I don't imagine it would be an overwhelmingly difficult task. However, having said that, if this is just a one-off exercise, perhaps something along these lines would suffice:

        $ for i in latin/*.html; do > piconv -f ISO-8859-1 -t utf8 $i | \ > perl -pe 's/((?>charset|encoding)=)iso-8859-1/${1}utf-8/gi' - \ > > utf8/`basename $i` > done

        -- Ken

Re: Why won't Perl convert (Latin1 | ISO-8859-1) to (UTF-8 | utf8)?
by moritz (Cardinal) on Jun 05, 2013 at 19:02 UTC

    When looking at your file, use hexdump. hexdump never lies to you, contrary to many terminals, browsers, text editors and other tools.

    Chances are that the conversion works, but whatever you use to identify the encoding lies to your.

      @space_monk, @Anonymous Monk, & @moritz
      Thanks for taking the time to respond.
      I use my Editor, and file(1); as in:
      file ./index.html file -i ./index.html file --mime-type ./index.html file --mime-encoding ./index.html
      As to my Editor; it has always correctly reported the loaded file(s) properties in the past,
      and I have no reason to think it suddenly decided to stop. :)

      --chris

      #!/usr/bin/perl -Tw
      use perl::always;
      my $perl_version = "5.12.4";
      print $perl_version;
Re: Why won't Perl convert (Latin1 | ISO-8859-1) to (UTF-8 | utf8)?
by Tanktalus (Canon) on Jun 08, 2013 at 22:30 UTC

    This looks to me like a fundamental misunderstanding of what encoding is, and what encodings exist, and maybe more on this topic as well.

    An encoding is just a way to map numbers (whether one byte or more) to glyphs, such as mapping the number 97 to the glyph "a".

    Different encodings have different mappings. Not counting unicode encodings (UTF-8, UTF-16, UTF-32, etc., and, yes, there are more) some glyphs appear in more than one encoding, some glyphs appear in different places in different encodings, some glyphs occur in the same place in some encodings (but different in others), some glyphs occur in the same place in every encoding they appear in, and some glyphs appear in the same place in all encodings.

    And some glyphs appear in the same place in all encodings and the same place in unicode encodings (possibly with the exception of UTF-7). And that is likely where we are right here.

    If you compare the glyphs and their code points for all ordinals under 128 in ISO-88591 against those same code points in UTF-8, you will find that they are bit-for-bit identical. That is, there is no actual way to tell that a UTF-8 file that only uses the code points under 128 as found in ISO-88591 is not actually ISO-88591. Whether you treat it as ISO-88591 or as UTF-8, it doesn't change anything.

    So, when you convert from one to the other, you can do so with the "copy" ("cp") command.

    (See the conversation in one of my recent threads for another example along the same confusion.)

    Your starting file already is UTF-8. If the "file" command can't tell them apart, that's because there is no telling them apart. However, as html, the file command may also use extra heuristics, such as looking for meta tags. So when you change the meta tags, you change the output of file. I don't know if the meta tag was different from the actual encoding if someone would complain, other than your users.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1037259]
Approved by kcott
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others meditating upon the Monastery: (4)
As of 2014-12-27 03:29 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (176 votes), past polls