Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

Fixing broken character encoding

by pfaut (Priest)
on Jul 26, 2012 at 01:14 UTC ( #983757=perlquestion: print w/ replies, xml ) Need Help??
pfaut has asked for the wisdom of the Perl Monks concerning the following question:

Is it possible to use perl to fix broken HTML character encoding?

I am downloading RSS data from a site and it appears that it was created with a broken program. It claims to be UTF-8 but I believe it should have been ISO8859-1. I see things in the text stream that look like ’ which should translate to an apostrophe. I think something grabbed the bytes, converted them to HTML entities and then claimed the result was UTF-8. I don't know enough about character encoding or the perl modules to manipulate encoding to figure out how I might convert this back to something that displays correctly in a browser.

I've already complained to the site admins but they haven't fixed the RSS generator yet and I don't suppose they will any time soon.

90% of every Perl application is already written.
dragonchild

Comment on Fixing broken character encoding
Download Code
Re: Fixing broken character encoding
by Anonymous Monk on Jul 26, 2012 at 03:02 UTC

    It is possible if its been malformed once (single step), multiple iterations can be impossible.

    IIRC I think http://validator.w3.org/ can help ( Bundle::W3C::Validator )

    As can these
    HTML::Encoding - Determine the encoding of HTML/XML/XHTML documents
    Encode::Detective - detect a data encoding
    Encoding::FixLatin - takes mixed encoding input and produces UTF-8 output
    Encode::DoubleEncodedUTF8 - Fix double encoded UTF-8 bytes to the correct one

    But you ought to post some minimal html

    I figure it ought to be as simple as parsing the file, decoding the entities, treating the string as octets and deciding what charset it is

      This could work

      #!/usr/bin/perl -- use strict; use warnings; use Data::Dump qw' dd '; use HTML::Entities qw' encode_entities decode_entities '; use Encode qw' encode decode '; use Encode::Detective qw' detect '; my $odata = my $str = '’'; decode_entities($str); dd $str; dd encode_entities($str); my $encoding = detect($str); dd $encoding; $str = decode( 'UTF-8', $str ); dd $str; dd encode_entities($str); __END__ "\xE2\x80\x99" "’" "UTF-8" "\x{2019}" "’"

        That's showing some promise. Thank you.

        90% of every Perl application is already written.
        dragonchild
Re: Fixing broken character encoding
by moritz (Cardinal) on Jul 26, 2012 at 04:21 UTC

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://983757]
Approved by ww
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others avoiding work at the Monastery: (5)
As of 2014-07-31 00:24 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (241 votes), past polls