Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Converting UTF8 to ANSI

by palkia (Monk)
on Aug 25, 2017 at 18:09 UTC ( [id://1198027]=perlquestion: print w/replies, xml ) Need Help??

palkia has asked for the wisdom of the Perl Monks concerning the following question:

Hello.

I'm currently working with many *.txt files, but can only use ANSI Encoded ones, yet some of the files are utf8.
I tried to convert them but the best I was able to figure out is how to detect utf8 files ($txtLines[0] =~ /^\x{EF}\x{BB}\x{BF}/), so I can convert them manually one by one via the "Notepadd" program.
Eh..................... yeah, you get the picture.

Ideas ?
Thank you very much for any assistance ☺

Replies are listed 'Best First'.
Re: Converting UTF8 to ANSI
by haukex (Archbishop) on Aug 25, 2017 at 19:38 UTC

    \x{EF}\x{BB}\x{BF} is the UTF-8 BOM, but because you're looking for the UTF-8 encoded BOM EF BB BF instead of the Unicode character U+FEFF, that tells me you haven't opened the file with the right encoding, and personally I think decoding afterwards is more of a pain than opening the file with the right encoding in the first place. Also, by "ANSI" I assume you mean Windows-1252. Anyway, if you are certain that all of your UTF-8 encoded files begin with a BOM, you can use File::BOM, the following will open files that have a BOM with the proper encoding, but fall back to CP-1252 if they don't:

    use File::BOM qw/open_bom/; open_bom(my $fh, $filename, ':encoding(cp1252)');

    Otherwise, if you have no sure way of telling the files apart, you may have to use Encode::Guess, with the caveat that it's just a guess. Something like this maybe:

    use Encode::Guess; open my $fh, '<:raw', $filename or die $!; read $fh, my $buf, 1024; # may need bigger buffer for better guess? close $fh; my $enc = guess_encoding($buf, qw/cp1252 utf8 UTF-16/); ref($enc) or die "Can't guess $filename: $enc"; print "$filename: guessed ",$enc->name,"\n"; #Debug open $fh, '<:encoding('.$enc->name.')', $filename or die $!;

    In both cases, you may want to strip the BOM off the beginning of the data read from the file via $data =~ s/\A\x{FEFF}//;

      by "ANSI" I assume you mean Windows-1252.

      To find out a machine's actual "ANSI" encoding, you can use the following:

      use Win32 qw( ); my $ansi_enc = "cp".Win32::GetACP();

        Excellent tip. FWIW, I had to update to use it.

        PS C:\Users\moo> perl -MWin32 -E 'say Win32::GetACP()' Undefined subroutine &Win32::GetACP called at -e line 1. PS C:\Users\moo> cpanm Win32 --> Working on Win32 Fetching http://www.cpan.org/authors/id/J/JD/JDB/Win32-0.52.tar.gz ... + O Configuring Win32-0.52 ... OK Building and testing Win32-0.52 ... OK Successfully installed Win32-0.52 (upgraded from 0.44) 1 distribution installed PS C:\Users\moo> perl -MWin32 -E 'say Win32::GetACP()' 1252
        Thank you.
        I didn't even know a machine can have, an actual vs other, ANSI encoding.
        I'll look in to it as soon as I can.
      Thank you for your replay.

      You are correct, my understanding of the differences between specific encoding form is strictly theoretical and limited at the moment, especially in the perl context (1st time I ever encounter the term BOM).
      I hope to learn more about it as soon as I can.

      As for what I mean by ANSI, I really don't know.
      All I know is what the encoding line says when I "save as" a file with "Notepad" (Win-xp).

      Unfortunately I'm currently preoccupied with the fallout of attempting to install File::BOM as you can see here.
      Any assistance with this bigger issue will be most appreciated.
        As for what I mean by ANSI, I really don't know.

        Welcome to the wonderful world of character encodings!

        What you may mean is the ASCII character encoding. This is an old, 7-bit encoding with the most significant bit (bit 7) always 0. One neat thing about the newer UTF-8 encoding (some people say it's the only neat thing) is that all valid ASCII characters are automatically valid UTF-8 characters. Unfortunately, things quickly go to pieces after that; not all valid UTF-8 characters are valid ASCII, and any mapping of UTF-8 to ASCII is totally arbitrary. Oh, well...


        Give a man a fish:  <%-{-{-{-<

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1198027]
Approved by stevieb
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others taking refuge in the Monastery: (5)
As of 2024-03-28 23:08 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found