Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

"Wide character in print"

by axl163 (Scribe)
on May 05, 2007 at 22:30 UTC ( [id://613765]=perlquestion: print w/replies, xml ) Need Help??

axl163 has asked for the wisdom of the Perl Monks concerning the following question:

Hi Perl Monks, I have ran into this issue and according to past threads, I am supposed to tell Perl to output in UTF-8. I added the following to my code:
open(OUT,">out.html") || die("Cannot Open File"); binmode(OUT, ":utf8");

The error has gone away but there are weird characters that are still showing up like this when I print it out:
 Upromise – The way to save for college
and
 Client Server Security for Your Small Business –
I am using SOAP::Lite to obtain html links and the symbols do not seem to get encoded/decoded correctly when I obtain them. The code I use to obtain them is something like this:

Update: I updated the code I used to obtain the information.
open(OUT,">out.html") || die("Cannot Open File"); binmode(OUT, ":utf8"); my $service = SOAP::Lite->service('http://www.url.com/file.wsdl'); my $arguments = "my_id"; my $result = $service->search($my_arguments); my $url = $link->{linkCodeHTML}; print OUT $url . "<br>";

I'm not exactly sure what it is supposed to like since it looks like that when I print it out. I assume they are trademarks of some sort.

Any suggestions would be greatly appreciated...

> Thanks,

Perl newb

Replies are listed 'Best First'.
Re: "Wide character in print"
by graff (Chancellor) on May 06, 2007 at 00:53 UTC
    What is supposed to be showing up? Where is the data coming from? What is your script doing to it before printing it?

    The warning message about "wide character in print" was telling you that you were using "print" (or printf) with the output file handle, the data being printed contained strings flagged as containing non-ASCII utf8 characters, and the handle had not been declared to accommodate such data.

    Sure, changing the file handle to accommodate utf8 data gets rid of the warning, but it doesn't really change the data that caused the warning in the first place.

    You need to supply more information. There are a variety of possible "solutions" -- changing how you view the data, changing the data in any of various ways before printing it, and so on -- but we don't know enough about your problem yet to make a recommendation.

    Update: Now that you have supplied more information, I can make a few observations:

    1. Whenever you change the content of your post, please make it clear to others that you have changed the content -- use "update:" (like I've done here) to indicate what has been added, and put  <strike> ... </strike> around things that you want to delete (rather than just deleting them), so that replies that were based on your original post will still make sense.

    2. Based on the context you've added around the "weird characters" (originally you just showed those characters in isolation), it looks like you are downloading a page that might be using some character set other than utf8, and it's being interpreted incorrectly as (or into) utf8.

    You should try looking at the original content in a browser window, and use different character encodings in that window until you see a display that makes sense. That's one way to figure out which encoding is being used in the source data.

    I would expect that the true character encoding being used would be mentioned somewhere in the data, as part of the header, or a tag attribute, or something -- that's another way to find that out.

    If you just want to get rid of the wide characters, you can do this, which will work no matter what is going wrong with the encoding:

    s/[^[:ascii:]]+//g; # get rid of non-ASCII characters
    If you need to keep those characters, the first thing is to look at your output using a browser, so that utf8 data are displayed correctly using utf8 characters. In that view, if you see two or more characters where you expected to see only one, you'll need to figure out how to use the Encode module on your data.

    But if you get to that point and can't figure it out, you'll need to show us a small amount of usable code that demonstrates the problem. Just saying "I'm using SOAP::Lite" (as you did in your first update) isn't enough.

    Another update: That sequence of three non-ASCII bytes that shows up twice in your updated sample text happens to be "\xE2","\x80","\x93". This is the utf8 byte sequence to express the unicode character "\x{2013}", which turns out to be "EN DASH" (in other words, a hyphen). To see it as a hyphen, you could just do  s/\x{2013}/-/g; on your text data.

    (But if you're getting other wide characters beside that one, you might not find suitable ASCII correlates for all of them, so this may not work out as a general solution.)

      use pragma { no warnings; print ... } or binmode STDOUT, ":utf8";
        Graff, The line below worked great! I found articles about encoding and including modules and nothing worked. This worked great! I also learned a little about using strike. Thanks!!!!  s/[^[:ascii:]]+//g;  # get rid of non-ASCII characters
      Adding to graff's suggestion, using Text::Unidecode will translate your wide characters to ascii:
      use utf8; use Text::Unidecode; foreach ( @strings ){ $_ =~ s/([^[:ascii:]]+)/unidecode($1)/ge; }
Re: "Wide character in print"
by graff (Chancellor) on May 07, 2007 at 07:19 UTC
    Now that you've updated your node again (thanks for indicating the update), I'd point out that the code as posted looks bad, and I'd be surprised if it works at all. Since you are using "my" to declare variables, you should be doing "use strict;" as well, to get the real benefit.

    As it is, you assign a value to "my $arguments", but then you don't use it (you use a variable called $my_arguments instead). Then, I can't figure out where the $link thing is coming from... should that have been spelled "$result" instead?

    Maybe I'm just naive, but when I tried the url you mention in your code (www.url.com/file.wsdl), I somehow got to "coolchaser.com", and that page did not appear to have any wide-character data (nor a "search" service).

    Based on your updates, it doesn't seem as though you've made any progress, despite the information you've been given. Maybe you could post a reply to one of my nodes in this thread, to provide a simple, runnable code snippet that demonstrates the problem. (Please don't make any more major changes to your root node -- this thread is already too confused.)

    Also, let us know whether you've tried the suggestions, and what happened when you did. What are you using to view the output? Are you still in doubt about the fact that your three bytes of "weird characters" are simply the utf8 sequence for the "en dash" character (unicode codepoint U+2013)?

Re: "Wide character in print"
by ikegami (Patriarch) on Sep 14, 2009 at 16:38 UTC

    The error has gone away but there are weird characters that are still showing up like this when I print it out:

    There are many different ways of representing characters using bytes. These are called "character encodings", or just "encodings" for short.

    By using :utf8, you told Perl to encode the characters using UTF-8. (Actually, using a superset of UTF-8 specific to Perl, but that's ok.) However, your viewer appears to be assuming the content of the file is encoded using iso-latin-1 (or something).

    Tell your viewer the file is UTF-8, or use the encoding your viewer expects instead of UTF-8. The latter is done using:

    binmode OUT, ':encoding(name_of_encoding_here)';

    For files encoded using UTF-8, some viewers will react positively to having chr(0xFEFF) as the first character.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://613765]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others cooling their heels in the Monastery: (4)
As of 2024-12-03 09:19 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found