http://www.perlmonks.org?node_id=1037466


in reply to Re: Why won't Perl convert (Latin1 | ISO-8859-1) to (UTF-8 | utf8)?
in thread Why won't Perl convert (Latin1 | ISO-8859-1) to (UTF-8 | utf8)?

G'day to you too, mate!
I'd like to preface my response, by letting you know that I really appreciate all the time, and effort you put into all your responses -- +2 to you.
As to the iso-8859-1 => utf8 issue I'm having, and your reply...
I'm keen on the points you've made. I do recognize that the action(s) that iconv(1) && piconv(1) peform upon their subject files, do not read such html tags as <meta http-equiv="content-type" content="application/xhtml+xml; charset=iso-8859-1 || utf-8" />
Which is why I was so puzzled as to why the file(s) would take on the requested "MimeType" after changing that line.
I'm wondering if it wouldn't make more sense for me to attempt to create my own "converter" utilizing Encode.pm -- which I believe piconv(1) uses anyway.

Thanks again, for taking the time to respond.

--chris

#!/usr/bin/perl -Tw
use perl::always;
my $perl_version = "5.12.4";
print $perl_version;
  • Comment on Re^2: Why won't Perl convert (Latin1 | ISO-8859-1) to (UTF-8 | utf8)?

Replies are listed 'Best First'.
Re^3: Why won't Perl convert (Latin1 | ISO-8859-1) to (UTF-8 | utf8)?
by chromatic (Archbishop) on Jun 06, 2013 at 16:24 UTC
    Which is why I was so puzzled as to why the file(s) would take on the requested "MimeType" after changing that line.

    From my experiments (and the documentation), file can only guess at the encoding by the presence or absence of specific codepoints. If your file has only those UTF-8 codepoints which also fit the Latin-1 encoding, perhaps it's all too happy to report the file as Latin-1. In my experience, iconv doesn't add a UTF-8 BOM to the start of the file. When I did that with Vim, file was a lot more specific.

      Greetings chromatic, and thank you for your reply.

      Indeed. file(1) depends on it's "Magic" file||dir for it's response(s), and they aren't extremely concise. But, it was only after noticing that my editor wasn't reporting a change from iso-8859-1 to utf8, that I used file(1) to help me confirm that my editor had not stopped functioning reliably after all these years. I might also add, my editor also recognizes when a file that it has opened has been "touched" -- as do all modern editors.

      I'll probably have to try and create a filter using Encode.pm to pipe these files through.

      I was just hoping that those utilities that were already designed to perform just such tasks, would | could accomplish this.


      Thanks again chromatic, for taking the time to respond.

      --chris

      #!/usr/bin/perl -Tw
      use perl::always;
      my $perl_version = "5.12.4";
      print $perl_version;
Re^3: Why won't Perl convert (Latin1 | ISO-8859-1) to (UTF-8 | utf8)?
by kcott (Archbishop) on Jun 06, 2013 at 23:35 UTC

    Thanks for your complimentary remarks — they are appreciated.

    piconv does use Encode. It's also relatively short: if you ignore the option handling, POD, etc., you're left with probably less than 100 lines of code. So, if you wanted to use that as a starting point to roll your own version, I don't imagine it would be an overwhelmingly difficult task. However, having said that, if this is just a one-off exercise, perhaps something along these lines would suffice:

    $ for i in latin/*.html; do > piconv -f ISO-8859-1 -t utf8 $i | \ > perl -pe 's/((?>charset|encoding)=)iso-8859-1/${1}utf-8/gi' - \ > > utf8/`basename $i` > done

    -- Ken

      Greetings kcott, and thanks for your reply.

      Indeed. That does do the trick!
      Thanks!
      I sure wish I was as well versed in Perl as you are. But I'm afraid I've got a ways to go yet. :(

      As always, very grateful for all your time, and consideration.

      --chris

      #!/usr/bin/perl -Tw
      use perl::always;
      my $perl_version = "5.12.4";
      print $perl_version;