Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Re: Why won't Perl convert (Latin1 | ISO-8859-1) to (UTF-8 | utf8)?

by kcott (Abbot)
on Jun 05, 2013 at 18:09 UTC ( #1037270=note: print w/ replies, xml ) Need Help??


in reply to Why won't Perl convert (Latin1 | ISO-8859-1) to (UTF-8 | utf8)?

G'day taint,

piconv converts character encodings. Here's an example of ISO-8859-1 to UTF-8 and back again (using the copyright sign):

$ piconv -f ISO-8859-1 -t utf8 -s '' © $ piconv -t ISO-8859-1 -f utf8 -s '©'

piconv does not look for keys such as "charset" or "encoding" and attempt to change their values.

Also, all the characters in the string "iso-8859-1" are ASCII; their values are identical to the Unicode code points of the corresponding characters. Had that meta element contained non-ASCII characters, you would have seen some conversion.

$ piconv -f ISO-8859-1 -t utf8 \ > -s '<meta http-equiv="Content-Type" content="text/html; charset= +iso-8859-1">' <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1 +"> $ piconv -f ISO-8859-1 -t utf8 \ > -s '<meta name="registered sign" content="">' <meta name="registered sign" content="®">

To convert your HTML files, you'll need to run piconv and also change "iso-8859-1" references to "utf-8". Be aware that there are several places in which encodings might be specified: for instance, meta and script elements may contain a charset attribute and XHTML documents may include encoding attributes.

-- Ken


Comment on Re: Why won't Perl convert (Latin1 | ISO-8859-1) to (UTF-8 | utf8)?
Select or Download Code
Re^2: Why won't Perl convert (Latin1 | ISO-8859-1) to (UTF-8 | utf8)?
by taint (Chaplain) on Jun 06, 2013 at 14:39 UTC
    G'day to you too, mate!
    I'd like to preface my response, by letting you know that I really appreciate all the time, and effort you put into all your responses -- +2 to you.
    As to the iso-8859-1 => utf8 issue I'm having, and your reply...
    I'm keen on the points you've made. I do recognize that the action(s) that iconv(1) && piconv(1) peform upon their subject files, do not read such html tags as <meta http-equiv="content-type" content="application/xhtml+xml; charset=iso-8859-1 || utf-8" />
    Which is why I was so puzzled as to why the file(s) would take on the requested "MimeType" after changing that line.
    I'm wondering if it wouldn't make more sense for me to attempt to create my own "converter" utilizing Encode.pm -- which I believe piconv(1) uses anyway.

    Thanks again, for taking the time to respond.

    --chris

    #!/usr/bin/perl -Tw
    use perl::always;
    my $perl_version = "5.12.4";
    print $perl_version;
      Which is why I was so puzzled as to why the file(s) would take on the requested "MimeType" after changing that line.

      From my experiments (and the documentation), file can only guess at the encoding by the presence or absence of specific codepoints. If your file has only those UTF-8 codepoints which also fit the Latin-1 encoding, perhaps it's all too happy to report the file as Latin-1. In my experience, iconv doesn't add a UTF-8 BOM to the start of the file. When I did that with Vim, file was a lot more specific.

        Greetings chromatic, and thank you for your reply.

        Indeed. file(1) depends on it's "Magic" file||dir for it's response(s), and they aren't extremely concise. But, it was only after noticing that my editor wasn't reporting a change from iso-8859-1 to utf8, that I used file(1) to help me confirm that my editor had not stopped functioning reliably after all these years. I might also add, my editor also recognizes when a file that it has opened has been "touched" -- as do all modern editors.

        I'll probably have to try and create a filter using Encode.pm to pipe these files through.

        I was just hoping that those utilities that were already designed to perform just such tasks, would | could accomplish this.


        Thanks again chromatic, for taking the time to respond.

        --chris

        #!/usr/bin/perl -Tw
        use perl::always;
        my $perl_version = "5.12.4";
        print $perl_version;

      Thanks for your complimentary remarks — they are appreciated.

      piconv does use Encode. It's also relatively short: if you ignore the option handling, POD, etc., you're left with probably less than 100 lines of code. So, if you wanted to use that as a starting point to roll your own version, I don't imagine it would be an overwhelmingly difficult task. However, having said that, if this is just a one-off exercise, perhaps something along these lines would suffice:

      $ for i in latin/*.html; do > piconv -f ISO-8859-1 -t utf8 $i | \ > perl -pe 's/((?>charset|encoding)=)iso-8859-1/${1}utf-8/gi' - \ > > utf8/`basename $i` > done

      -- Ken

        Greetings kcott, and thanks for your reply.

        Indeed. That does do the trick!
        Thanks!
        I sure wish I was as well versed in Perl as you are. But I'm afraid I've got a ways to go yet. :(

        As always, very grateful for all your time, and consideration.

        --chris

        #!/usr/bin/perl -Tw
        use perl::always;
        my $perl_version = "5.12.4";
        print $perl_version;

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1037270]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others musing on the Monastery: (8)
As of 2014-09-23 09:39 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (216 votes), past polls