Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

can't get rid of BOM from UTF-8 webpage

by BeneSphinx (Sexton)
on May 20, 2012 at 07:23 UTC ( [id://971467]=perlquestion: print w/replies, xml ) Need Help??

BeneSphinx has asked for the wisdom of the Perl Monks concerning the following question:

I have a script that reads a UTF-8 encoded webpage (actually just a text file on a website). It is in UTF-8 with the byte-order-mark (BOM) sequence, although the Content Type header is just text/plain. I get it with code like:

use LWP::UserAgent; use XML::Simple; use Encode qw(encode decode); use warnings; use strict; ... my $content; my $ua = LWP::UserAgent->new; my $request = HTTP::Request->new("GET",$url); unless ($skipAuth){ $request->authorization_basic($user, $pass); } $ua->prepare_request($request); my $response = $ua->send_request($request); if ($pageFormat eq "xml"){ $content = XMLin($response->decoded_content((charset => "utf8"))); } else { #txt $content = $response->decoded_content((charset => "utf8")); #Prints UTF-8 as expected. print "CONTENT CHARSET " . $response->content_charset() . "\n\n"; #All of the below statements print the BOM as literal characters to +the Windows CMD screen print "CONTENT: " . $response->content() . "\n"; print "DECODED CONTENT: " . $response->decoded_content() . "\n"; print "DECODED CONTENT WITH UTF-8 SPECIFIED: " . $response->decoded_ +content((charset => "utf8")) . "\n"; print "MANUALLY DECODED: " . decode("UTF-8", $response->content()); }

Whenever I run this in Windows CMD prompt, I always get the BOM marks () printed to screen. I've seen lots of suggestions, from switching the Windows code page to UTF-8 ("chcp 65001"), to decoding or encoding at various stages, but nothing works.

When I print to file, however, I get a file that both Notepad and Notepad++ can read without the BOM. I think they both detect it as UTF-8 and hide the BOM:

open(RESULT, ">result.txt"); print RESULT $content; close(RESULT);
When I run "type result.txt" from the cmd prompt, it spits out the file contents with the BOM showing again.

So, it seems that throughout the process, Perl, Notepad, and Notepad++ correctly and consistently treat the text as UTF-8. What's odd is that the CMD prompt doesn't, and always shows those marks, even after I change the code page to 65001.

My first question is why the CMD prompt isn't handling the BOM correctly, even after being told to use the UTF-8 code page. My second question is why Perl insists on keeping the BOM and printing it later. I would have expected it to be stripped during the initial read of the text file, since it's just packaging, and omitted in Perl's internal character representation.

Overall, though, I'd like to learn where to fix the problem. Do I configure Windows differently? Do I read the text file differently in Perl? Or do I just print things differently in Perl? Any insights or suggestions will be greatly appreciated.

Replies are listed 'Best First'.
Re: can't get rid of BOM from UTF-8 webpage
by Anonymous Monk on May 20, 2012 at 08:15 UTC

    Hi :)

    My second question is why Perl insists on keeping the BOM and printing it later

    Because it would be insane to throw it away without being told to throw it away.

    I would have expected it to be stripped during the initial read of the text file, since it's just packaging, and omitted in Perl's internal character representation.

    Besides not being mere packaging it isn't "omitted"; Your expectations is wrong.

    Overall, though, I'd like to learn where to fix the problem. Do I configure Windows differently? Do I read the text file differently in Perl? Or do I just print things differently in Perl? Any insights or suggestions will be greatly appreciated.

    for cmd.exe change fonts, I read fonts are responsible for not showing BOM

    or try PowerShell, I hear that thing is unicode by default, so it ought to come with fonts that know to hide BOM

    or from perl, strip the bom , say by using :encoding(UTF-8):via(File::BOM), and/or skip printing BOM when -t Filehandle is opened to a tty (tty means console, cmd.exe )


    I've seen lots of suggestions ...

    Next time, include those links in your post :)

    FWIW, Content-type is not charset

    FWIW, utf8 is not UTF-8, the difference could be important

    BUT, FWIW, you shouldn't specify charset (utf8 or UTF-8) to decoded_content, that is webservers job , it should just work already

    My first question is why the CMD prompt isn't handling the BOM correctly,

    seems to me something on MSDN would answer that :p

Re: can't get rid of BOM from UTF-8 webpage
by mbethke (Hermit) on May 20, 2012 at 15:34 UTC

    The codepage doesn't have anything to do with it---either the shell strips the BOM internally or it doesn't. However if it implemented Unicode correctly it should render the BOM as an invisible characters. Of course a BOM is completely superfluous¹ in UTF-8 (Notepad BTW is notorious for writing one anyway) and I agree it could well be discarded upon reading. As it doesn't, just strip it out as suggested in the post above.

    ¹ OK, it <em<could serve to identify UTF-8 with just the first couple of bytes if it were consistently applied, but as it's not recommended by the standard, hardly anyone does it so identification of unknown files always has to rely on larger data chunks anyway.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://971467]
Approved by Gangabass
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others avoiding work at the Monastery: (6)
As of 2024-04-25 15:51 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found