Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Fixing broken character encoding

by pfaut (Priest)
on Jul 26, 2012 at 01:14 UTC ( #983757=perlquestion: print w/ replies, xml ) Need Help??
pfaut has asked for the wisdom of the Perl Monks concerning the following question:

Is it possible to use perl to fix broken HTML character encoding?

I am downloading RSS data from a site and it appears that it was created with a broken program. It claims to be UTF-8 but I believe it should have been ISO8859-1. I see things in the text stream that look like ’ which should translate to an apostrophe. I think something grabbed the bytes, converted them to HTML entities and then claimed the result was UTF-8. I don't know enough about character encoding or the perl modules to manipulate encoding to figure out how I might convert this back to something that displays correctly in a browser.

I've already complained to the site admins but they haven't fixed the RSS generator yet and I don't suppose they will any time soon.

90% of every Perl application is already written.
dragonchild

Comment on Fixing broken character encoding
Download Code
Re: Fixing broken character encoding
by Anonymous Monk on Jul 26, 2012 at 03:02 UTC

    It is possible if its been malformed once (single step), multiple iterations can be impossible.

    IIRC I think http://validator.w3.org/ can help ( Bundle::W3C::Validator )

    As can these
    HTML::Encoding - Determine the encoding of HTML/XML/XHTML documents
    Encode::Detective - detect a data encoding
    Encoding::FixLatin - takes mixed encoding input and produces UTF-8 output
    Encode::DoubleEncodedUTF8 - Fix double encoded UTF-8 bytes to the correct one

    But you ought to post some minimal html

    I figure it ought to be as simple as parsing the file, decoding the entities, treating the string as octets and deciding what charset it is

      This could work

      #!/usr/bin/perl -- use strict; use warnings; use Data::Dump qw' dd '; use HTML::Entities qw' encode_entities decode_entities '; use Encode qw' encode decode '; use Encode::Detective qw' detect '; my $odata = my $str = '’'; decode_entities($str); dd $str; dd encode_entities($str); my $encoding = detect($str); dd $encoding; $str = decode( 'UTF-8', $str ); dd $str; dd encode_entities($str); __END__ "\xE2\x80\x99" "’" "UTF-8" "\x{2019}" "’"

        That's showing some promise. Thank you.

        90% of every Perl application is already written.
        dragonchild
Re: Fixing broken character encoding
by moritz (Cardinal) on Jul 26, 2012 at 04:21 UTC

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://983757]
Approved by ww
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others meditating upon the Monastery: (13)
As of 2014-12-26 06:47 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (167 votes), past polls