Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

how to determine the text file encoding

by uva (Sexton)
on Mar 16, 2006 at 17:05 UTC ( [id://537216]=perlquestion: print w/replies, xml ) Need Help??

uva has asked for the wisdom of the Perl Monks concerning the following question:

i found the Byte-order mask only for these encoding,
"FF FE" UCS-2LE or UTF-16LE
"FE FF" UCS-2BE or UTF-16BE
"EF BB BF" UTF-8
but i got a strange Byte order Mask while writing some utf8 format text into a utf8 format file
this is the BOM i got "e4 b8 ad " after i wrote in that utf8 format file.
check it for BOM in this site http://msdn.microsoft.com/library/default.asp?url=/library/en-us/intl/unicode_42jv.asp
the code i used is
use Encode; use Encode::HanConvert; $str8=gb_to_simp("中国的网页");#the str +ing inside this is chinese characters use utf8; open OUT,">:utf8","D:\\output1.doc" or print "could not open"; print "\n",utf8::is_utf8($str8),"\n\n\n"; print OUT $str8; close OUT; no utf8;
Anyone tell me why this is happening,
i am wrinting in utf8 format only.but the Byte order mask shows different.why? help me in that.
The actual Byte order mask for utf8 is "EF BB BF".

Replies are listed 'Best First'.
Re: how to determine the text file encoding
by ikegami (Patriarch) on Mar 16, 2006 at 17:43 UTC

    (I'm guessing) Perl won't output a BOM character unless you tell it to do so using print(chr(0xFEFF));. While I can't run your program with my version of Perl, the print I mentioned gives me EF BB BF as expected. It makes sense for Perl to not output the BOM automatically because it is not always necessary, and sometimes it isn't even permitted. According to unicode.org,

    Where the data is typed, such as a field in a database, a BOM is unnecessary. In particular, if a text data stream is marked as UTF-16BE, UTF-16LE, UTF-32BE or UTF-32LE, a BOM is neither necessary nor permitted. Any FEFF would be interpreted as a ZWNBSP.
Re: how to determine the text file encoding
by graff (Chancellor) on Mar 17, 2006 at 03:05 UTC
    To follow up on the Anonymous Monk's reply (which is basically correct), you need to come up with the correct way to have a GB-encoded string assigned to a scalar variable, so that you can pass this to the "gb_to_simp()" function.

    For that matter, since you have your string in utf8 already, you don't need to use "gb-to_simp()" (or Encode) at all.

    And the point about the BOM, as explained in the first reply, is that perl won't print one automatically -- if you want it in the output, include it in the print statement.

Re: how to determine the text file encoding
by Anonymous Monk on Mar 16, 2006 at 17:44 UTC

    You have a logic error in your premise here. This has nothing to do with BOMs. I'll explain.

    According to the documentation of Encode::HanConvert, gb_to_simp takes a string encoded in GBK and returns it in UTF-8. However, your string 中国的网页 is already in UTF-8. Your testing printout confirms it. Why is it in UTF-8? It's in the source code as literal, and you declared your source code to be treated as UTF-8 by the use utf8; pragma.

    If you had additionally used strict and diagnostics, you would have noticed that you're doing something wrong.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://537216]
Approved by kvale
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (4)
As of 2025-01-14 07:36 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Which URL do you most often use to access this site?












    Results (38 votes). Check out past polls.