Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Re: Fixing broken character encoding

by Anonymous Monk
on Jul 26, 2012 at 03:02 UTC ( #983760=note: print w/ replies, xml ) Need Help??


in reply to Fixing broken character encoding

It is possible if its been malformed once (single step), multiple iterations can be impossible.

IIRC I think http://validator.w3.org/ can help ( Bundle::W3C::Validator )

As can these
HTML::Encoding - Determine the encoding of HTML/XML/XHTML documents
Encode::Detective - detect a data encoding
Encoding::FixLatin - takes mixed encoding input and produces UTF-8 output
Encode::DoubleEncodedUTF8 - Fix double encoded UTF-8 bytes to the correct one

But you ought to post some minimal html

I figure it ought to be as simple as parsing the file, decoding the entities, treating the string as octets and deciding what charset it is


Comment on Re: Fixing broken character encoding
Re^2: Fixing broken character encoding
by Anonymous Monk on Jul 26, 2012 at 04:04 UTC

    This could work

    #!/usr/bin/perl -- use strict; use warnings; use Data::Dump qw' dd '; use HTML::Entities qw' encode_entities decode_entities '; use Encode qw' encode decode '; use Encode::Detective qw' detect '; my $odata = my $str = '’'; decode_entities($str); dd $str; dd encode_entities($str); my $encoding = detect($str); dd $encoding; $str = decode( 'UTF-8', $str ); dd $str; dd encode_entities($str); __END__ "\xE2\x80\x99" "’" "UTF-8" "\x{2019}" "’"

      That's showing some promise. Thank you.

      90% of every Perl application is already written.
      dragonchild

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://983760]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others lurking in the Monastery: (11)
As of 2014-10-22 07:29 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    For retirement, I am banking on:










    Results (114 votes), past polls