good chemistry is complicated, and a little bit messy -LW |
|
PerlMonks |
can't get rid of BOM from UTF-8 webpageby BeneSphinx (Sexton) |
on May 20, 2012 at 07:23 UTC ( [id://971467]=perlquestion: print w/replies, xml ) | Need Help?? |
BeneSphinx has asked for the wisdom of the Perl Monks concerning the following question: I have a script that reads a UTF-8 encoded webpage (actually just a text file on a website). It is in UTF-8 with the byte-order-mark (BOM) sequence, although the Content Type header is just text/plain. I get it with code like:
Whenever I run this in Windows CMD prompt, I always get the BOM marks () printed to screen. I've seen lots of suggestions, from switching the Windows code page to UTF-8 ("chcp 65001"), to decoding or encoding at various stages, but nothing works. When I print to file, however, I get a file that both Notepad and Notepad++ can read without the BOM. I think they both detect it as UTF-8 and hide the BOM: When I run "type result.txt" from the cmd prompt, it spits out the file contents with the BOM showing again. So, it seems that throughout the process, Perl, Notepad, and Notepad++ correctly and consistently treat the text as UTF-8. What's odd is that the CMD prompt doesn't, and always shows those marks, even after I change the code page to 65001. My first question is why the CMD prompt isn't handling the BOM correctly, even after being told to use the UTF-8 code page. My second question is why Perl insists on keeping the BOM and printing it later. I would have expected it to be stripped during the initial read of the text file, since it's just packaging, and omitted in Perl's internal character representation. Overall, though, I'd like to learn where to fix the problem. Do I configure Windows differently? Do I read the text file differently in Perl? Or do I just print things differently in Perl? Any insights or suggestions will be greatly appreciated.
Back to
Seekers of Perl Wisdom
|
|