Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

Re: Fixing broken character encoding

by Anonymous Monk
on Jul 26, 2012 at 03:02 UTC ( [id://983760]=note: print w/replies, xml ) Need Help??


in reply to Fixing broken character encoding

It is possible if its been malformed once (single step), multiple iterations can be impossible.

IIRC I think http://validator.w3.org/ can help ( Bundle::W3C::Validator )

As can these
HTML::Encoding - Determine the encoding of HTML/XML/XHTML documents
Encode::Detective - detect a data encoding
Encoding::FixLatin - takes mixed encoding input and produces UTF-8 output
Encode::DoubleEncodedUTF8 - Fix double encoded UTF-8 bytes to the correct one

But you ought to post some minimal html

I figure it ought to be as simple as parsing the file, decoding the entities, treating the string as octets and deciding what charset it is

Replies are listed 'Best First'.
Re^2: Fixing broken character encoding
by Anonymous Monk on Jul 26, 2012 at 04:04 UTC

    This could work

    #!/usr/bin/perl -- use strict; use warnings; use Data::Dump qw' dd '; use HTML::Entities qw' encode_entities decode_entities '; use Encode qw' encode decode '; use Encode::Detective qw' detect '; my $odata = my $str = '’'; decode_entities($str); dd $str; dd encode_entities($str); my $encoding = detect($str); dd $encoding; $str = decode( 'UTF-8', $str ); dd $str; dd encode_entities($str); __END__ "\xE2\x80\x99" "’" "UTF-8" "\x{2019}" "’"

      That's showing some promise. Thank you.

      90% of every Perl application is already written.
      dragonchild

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://983760]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others about the Monastery: (5)
As of 2024-04-23 11:13 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found