Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer

Fixing broken character encoding

by pfaut (Priest)
on Jul 26, 2012 at 01:14 UTC ( #983757=perlquestion: print w/replies, xml ) Need Help??
pfaut has asked for the wisdom of the Perl Monks concerning the following question:

Is it possible to use perl to fix broken HTML character encoding?

I am downloading RSS data from a site and it appears that it was created with a broken program. It claims to be UTF-8 but I believe it should have been ISO8859-1. I see things in the text stream that look like ’ which should translate to an apostrophe. I think something grabbed the bytes, converted them to HTML entities and then claimed the result was UTF-8. I don't know enough about character encoding or the perl modules to manipulate encoding to figure out how I might convert this back to something that displays correctly in a browser.

I've already complained to the site admins but they haven't fixed the RSS generator yet and I don't suppose they will any time soon.

90% of every Perl application is already written.

Replies are listed 'Best First'.
Re: Fixing broken character encoding
by moritz (Cardinal) on Jul 26, 2012 at 04:21 UTC
Re: Fixing broken character encoding
by Anonymous Monk on Jul 26, 2012 at 03:02 UTC

    It is possible if its been malformed once (single step), multiple iterations can be impossible.

    IIRC I think can help ( Bundle::W3C::Validator )

    As can these
    HTML::Encoding - Determine the encoding of HTML/XML/XHTML documents
    Encode::Detective - detect a data encoding
    Encoding::FixLatin - takes mixed encoding input and produces UTF-8 output
    Encode::DoubleEncodedUTF8 - Fix double encoded UTF-8 bytes to the correct one

    But you ought to post some minimal html

    I figure it ought to be as simple as parsing the file, decoding the entities, treating the string as octets and deciding what charset it is

      This could work

      #!/usr/bin/perl -- use strict; use warnings; use Data::Dump qw' dd '; use HTML::Entities qw' encode_entities decode_entities '; use Encode qw' encode decode '; use Encode::Detective qw' detect '; my $odata = my $str = '’'; decode_entities($str); dd $str; dd encode_entities($str); my $encoding = detect($str); dd $encoding; $str = decode( 'UTF-8', $str ); dd $str; dd encode_entities($str); __END__ "\xE2\x80\x99" "’" "UTF-8" "\x{2019}" "’"

        That's showing some promise. Thank you.

        90% of every Perl application is already written.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://983757]
Approved by ww
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others chilling in the Monastery: (6)
As of 2017-03-27 23:04 GMT
Find Nodes?
    Voting Booth?
    Should Pluto Get Its Planethood Back?

    Results (324 votes). Check out past polls.