Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

Re: What's the best way to detect character encodings, Windows-1252 v. UTF-8?

by grantm (Parson)
on Jun 18, 2011 at 00:44 UTC ( #910276=note: print w/ replies, xml ) Need Help??


in reply to What's the best way to detect character encodings, Windows-1252 v. UTF-8?

You might want to look at Encoding-FixLatin - I created it for a very similar situation. In my case I had a Postgres database from an application that had treated text as 8-bit binary strings. Each record was one of: ASCII, UTF-8, ISO-8859-1 or CP1252, but the DB dump as a whole was a mixture of all these. The documentation for Encoding::FixLatin describes the heuristics it uses.


Comment on Re: What's the best way to detect character encodings, Windows-1252 v. UTF-8?
Re^2: What's the best way to detect character encodings, Windows-1252 v. UTF-8?
by Khen1950fx (Canon) on Jun 18, 2011 at 11:37 UTC
    I tried your module using ikegami's cp1252. It works for me:
    #!/usr/bin/perl use Modern::Perl; use Search::Tools::UTF8; use Encoding::FixLatin qw(fix_latin); use Encode::Locale; use Encode; if ( -t ) { binmode(STDIN, ":encoding(console_in)"); binmode(STDOUT, ":encoding(console_out)"); binmode(STDERR, ":encoding(console_out)"); } my $text = "\xC9ric"; if (is_latin1($text) eq 1) { say "$text is latin1"; } else { return; } my $fix = fix_latin($text, ascii_hex => 0); if (looks_like_cp1252($fix) eq 0) { say "$fix cannot be mapped to utf8:-)"; } else { return; } say is_flagged_utf8($fix); say is_sane_utf8($fix); say is_valid_utf8($fix);

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://910276]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others romping around the Monastery: (5)
As of 2014-07-28 05:24 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (186 votes), past polls