Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Re^6: Encoding/decoding question

by slugger415 (Beadle)
on Sep 12, 2011 at 20:30 UTC ( #925566=note: print w/ replies, xml ) Need Help??


in reply to Re^5: Encoding/decoding question
in thread Encoding/decoding question

If I set encoding to ASCII, it converts it to this:

é

is that correct for the accented e? It still comes out wonky after my Perl script gets ahold of it:

réserve


Comment on Re^6: Encoding/decoding question
Re^7: Encoding/decoding question
by tchrist (Pilgrim) on Sep 12, 2011 at 21:23 UTC
    The input is UTF-8, but you are treating it as Latin-1. You cant do that. That is why you are getting that sort of output.
      Sorry to be dumb here, but where am I treating it as Latin-1? How do I change it? -- Scott
        I am processing some XHTML pages (using XML::Twig) that contain numerous character entities, such as:
        é
        Sorry to be dumb here, but where am I treating it as Latin-1? How do I change it?
        The problem is in your original XHTML file, because it has the literal 12‑byte ASCII sequence
        1. &
        2. #
        3. 1
        4. 9
        5. 5
        6. ;
        7. &
        8. #
        9. 1
        10. 6
        11. 9
        12. ;
        So something somewhere somewhen took a UTF‑8 file and replaced not each complete multibyte character with its single entity, but rather each individual component byte as the Latin‑1 code point number.

        This may have happened because some program read an undecoded binary byte stream and never decoded it before trying to convert non‑ASCII into numeric entities. For example, here I use -CS in the first process to say its UTF‑8 but then lie to the second one by using -C0 to say that it isnt. That would produce the sort of thing that you saw:

        $ perl -CS -le 'print "na\x{EF}ivet\x{E9}"' navet $ perl -CS -le 'print "na\x{EF}ivet\x{E9}"' | perl -C0 -pe 's/(\P{ASCII})/"&#".ord($1).";"/ge' naïiveté $ perl -CS -le 'print "na\x{EF}ivet\x{E9}"' | perl -C0 -pe 's/(\P{ASCII})/sprintf "&#x%02X;", ord($1)/ge' naïiveté
        Compare with the right answers:
        $ perl -CS -le 'print "na\x{EF}vet\x{E9}"' | perl -CS -pe 's/(\P{ASCII})/"&#".ord($1).";"/ge' naïveté $ perl -CS -le 'print "na\x{EF}vet\x{E9}"' | perl -CS -pe 's/(\P{ASCII})/sprintf "&#x%02X;", ord($1)/ge' naïveté
        So what you really need to do is track down whatever errant procedure caused this mess in the first place, and fix that, since it will never work right that way.

        This demonstrates putting it back to UTF-8:

        $ perl -CS -le 'print "na\x{EF}vet\x{E9}"' | perl -C0 -pe 's/(\P{ASCII})/"&#".ord($1).";"/ge' | perl -C0 -pe 's/&#(\d+);/chr($1)/ge' navet
        And this, heaven help you, demonstrates doing that and then doing the entities the right way around in the first place:
        $ perl -CS -le 'print "na\x{EF}vet\x{E9}"' | perl -C0 -pe 's/(\P{ASCII})/"&#".ord($1).";"/ge' | perl -MEncode -C0 -pe 's/&#(\d+);/chr($1)/ge; $_ = decode_utf8($_, 1); s/(\P{ASCII})/"&#".ord($1).";"/ge' naïveté
        That means that if you were courageous enough, you could just do this:
        $ perl -i.unmangled.by.$$ -MEncode -C0 -pe 's/&#(\d+);/chr($1)/ge; $_ += decode_utf8($_, 1); s/(\P{ASCII})/"&#".ord($1).";"/ge' all*your*bro +ken*files.xhtml
        Heres a version that runs from as a script instead of from the command line:
        #!/usr/bin/env perl use strict; use warnings; use Encode; die "gimme args" unless @ARGV; $^I = ".unmangled.by.$$"; while (<>) { s/&#(\d+);/chr($1)/ge; $_ = decode_utf8($_, 1); s/(\P{ASCII})/ "&#" . ord($1) . ";" /ge; print; }
        No warantees, though. Make sure you thoroughly understand all this before you further mangle your files.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://925566]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others lurking in the Monastery: (7)
As of 2014-09-21 14:33 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (172 votes), past polls