Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

Simplest Possible Way To Disable Unicode

by JapanIsShinto (Initiate)
on May 23, 2011 at 22:01 UTC ( #906373=perlquestion: print w/ replies, xml ) Need Help??
JapanIsShinto has asked for the wisdom of the Perl Monks concerning the following question:

I generate binary data from my Perl scripts. I do this by building up strings of yummy data that range from characters 0x00 to 0xFF. And I've been doing this for years without any problem.

Then, I get the bright idea to upgrade my Perl, and with it, I magically enter the world of Unicode. Maybe Unicode is great. Maybe one day it will make sense to me, and I'll appreciate it in all it's glory.

But not right now, please. Right now, I just want my scripts to work just like they worked in the past. Yeah, I'm sometimes on Windows and have to binmode() things. I'm used to that. But now when I do that, I get this:

Wide character in syswrite at ...

I start to read and... OH MY GOD, IT'S FULL OF STARS. Or at least endless documentation that I'm sure is terribly fascinating, but which doesn't get me any closer to doing what I want.

So please, take pity on me. There must be some incantation I can put in my scripts that tells Perl that I want to pretend-- even if it's just for a little while-- that Unicode doesn't exist and I'm back in 2005 when life felt simpler.

Alternatively, knowing that I have code that merrily does things like this...

my $image = chr(0xff) . pack('cN', $someThing, $otherThing);

... is there some minimally-invasive way to wrap the generation of such binary data so that when I syswrite it, I get good old-fashioned binary data out?

Comment on Simplest Possible Way To Disable Unicode
Select or Download Code
Re: Simplest Possible Way To Disable Unicode
by graff (Chancellor) on May 23, 2011 at 23:37 UTC
    If you show a little more code, esp. how your perl script opens the output file, that may help. Your perl upgrade might be interacting with your current shell environment regarding "locale", What version were you using the last time it worked? What version did you upgrade to? In what OS?

    Then, maybe a sample of the output you expect (e.g. as a snippet hex dump from a good file), along with a sample of what your upgraded perl is producing (also as a hex dump snippet).

    Till then, there's nothing much we can do...

    Update: BTW,the way to avoid unicode "interpretation" on output is to open the output file like this:

    open( my $outfh, '>:raw', $filename ) or die "$filename: $!\n";
    (but if the data you print to the output file handle has somehow been flagged as utf8 text, you might get a warning about that)
      Making sure the handle won't distort bytes sent to it won't help since the problem is that it's not bytes being sent to it.
Re: Simplest Possible Way To Disable Unicode
by Anonymous Monk on May 23, 2011 at 23:55 UTC
    $ perl -e " print chr(999) " Wide character in print at -e line 1. ╧ $ perl -e " binmode STDOUT; print chr(999) " Wide character in print at -e line 1. ╧ $ perl -Mdiagnostics -e " print chr(999) " Wide character in print at -e line 1 (#1) (S utf8) Perl met a wide character (>255) when it wasn't expecting one. This warning is by default on for I/O (like print). The eas +iest way to quiet this warning is simply to add the :utf8 layer to the output, e.g. binmode STDOUT, ':utf8'. Another way to turn off the warning is to add no warnings 'utf8'; but that is often closer to cheating. In general, you are supposed to explicitly mark the filehandle with an encoding, see open and perlfunc/binmode. ╧
    So to silence the warning, use
    $ perl -e " no warnings q[utf8]; print chr(999) " ╧
    using bytes would also silence it, but would change the semantics
    $ perl -e " use bytes; print chr(999) " τ
    perluniintro, perlunicode

      Simply silencing the warning is not the solution.

      print chr(200); print chr(1000);

      will continue to be different than

      print chr(200) . chr(1000);

      as shown here:

      >perl -we"no warnings qw( utf8 ); print chr(200); print chr(1000);" | +perl -nE"say length;" 3 >perl -we"no warnings qw( utf8 ); print chr(200) . chr(1000);" | perl +-nE"say length;" 4

      Update: Mistakenly used 100 instead of 200 originally.

        Simply silencing the warning is not the solution.

        So what is the solution?

      I don't see when use bytes; would possibly be useful to solve this issue. Instead of print warning and returning the possibly useful UTF-8 encoding of the input, it simply destroys the high bits of the bad data.

      $ perl -we'use bytes; $_ = chr(1000); print;' | od -t x1 0000000 e8 0000001 $ perl -we'use bytes; $_ = chr(232); print;' | od -t x1 0000000 e8 0000001

      Whereas the original situation had a chance of providing something useful, this isn't the case with use bytes;. It possibly make things worse and hides the error.

        I don't see when use bytes; would possibly be useful to solve this issue.
        Agreed. Its not even clear to me what sort of problem it would solve.

        Ive never been fond of od output myself. And you shouldnt need it, either. This tells the story clearly enough:

        % perl -wle 'print ord do { use bytes; chr(1000) }' 232 % perl -wle 'print ord do { use bytes; chr(232) }' 232 % perl -wle 'print ord do { no bytes; chr(1000) }' 1000
Re: Simplest Possible Way To Disable Unicode
by ikegami (Pope) on May 24, 2011 at 00:17 UTC

    File handles can only be used to transmit bytes. The warnings indicate you are trying to transmit characters that aren't bytes, meaning something that doesn't match /^[\x00-\xFF]*\z/.

    You need to convert the text you are trying to send into bytes. The process is a special case of serialisation known as character encoding.

    You may use Encode's encode, or you may add en encoding layer to the file handle.

    open(my $fh, '>', ...) or die ...; my $buf = encode('UTF-8', $text); sysrwite($fh, $buf); # Or: print $fh $buf;
    open(my $fh, '>:encoding(UTF-8)', ...) or die ...; sysrwite($fh, $text); # Or: print $fh $text;
    binmode($fh, ':encoding(UTF-8)'); sysrwite($fh, $text); # Or: print $fh $text;

    is there some minimally-invasive way to wrap [my $image = chr(0xff) . pack('cN', $someThing, $otherThing);] so that when I syswrite it, I get good old-fashioned binary data out?

    That code always produces bytes, even for invalid inputs. The resulting string will never cause "Wide character" warnings.

Re: Simplest Possible Way To Disable Unicode
by John M. Dlugosz (Monsignor) on May 24, 2011 at 00:27 UTC
    That is a fascinating topic. I've run into that error from time to time, and others get it in a different situation from you: when the internal data is already known to be the same as how it should be written (UTF-8).

    I think Perl has ratcheted up its features and gradually changed things, and you skipped too many generations in one go.

    What version are you using now? The latest 5.14?

    I think lots of us would like to discuss or at least see how "wide" data comes to be (you didn't store a funny character directly, right?) and what can control that.

    So, it would be especially good if you could post a trivial program that does some manipulation (that's not obviously referring to any wide characters directly), writes it out, and gets that error.

      when the internal data is already known to be the same as how it should be written (UTF-8).

      syswrite and print print the string, not its internal data. The internal storage format is not relevant here.

      >perl -we"$_ = chr(0xE9); utf8::downgrade($_); print;" | perl -nE"say +length;" 1 >perl -we"$_ = chr(0xE9); utf8::upgrade($_); print;" | perl -nE"say le +ngth;" 1 >perl -we"$_ = chr(1000); utf8::upgrade($_); print;" | perl -nE"say le +ngth;" Wide character in print at -e line 1. 2
Re: Simplest Possible Way To Disable Unicode
by BrowserUk (Pope) on May 24, 2011 at 03:40 UTC

    I agree. I also want use bytes; to disable those dumb warnings when I use chr and pack 'C*' on numbers greater than 255.

    If I do pack 'C*', 257;, I am explicitly stating that I am packing 8-bit byte date, not f****** "wide characters", and if the numeric value is greater than 8-bits, it should be silently truncated.

    If I want to pack wide characters, I can use the U template. I don't need or want those two data types conflated.

    More to the point, I think unicode should be explicitly enabled by those that need it, not have to be disabled by those that don't.

    IF it were possible for unicode to be used transparently, then it might make some sense to enable it by default, but since it cannot, it doesn't.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      Also, the open pragma doesn't disable the warnings either
      use open qw' :std IO :bytes '; use open qw' :std IO :raw ';

      it should be silently truncated.

      no warnings qw( pack );

      More to the point, I think unicode should be explicitly enabled by those that need it

      You're getting an overflow warning. It has nothing to do with Unicode. In fact, pack and unpack don't use Unicode at all.*

      * — Not even "U" has any understanding of Unicode.

      >perl -wE"say sprintf '%X', unpack 'U', pack 'U', 0x200000" 200000
        no warnings qw( pack );

        So, you'd have us throw away all the useful warnings that pack can emit when I do something wrong in order to disable the stupid warning emitted when it does something wrong. Cool-io. Not.

        You're getting an overflow warning.

        Oh sure. "Wide character" says 'overflow', like super-injunction says right to privacy for all.

        It has nothing to do with Unicode.

        Really? Can you guess where this direct quote " A Unicode character number." comes from?

        I don't give flying fig whether you want to conflate the term 'unicode' with that multiplicitous cock-up of formats that hide behind the moniker 'The Unicode Standard'(*), and can't see that I used the former as a short-hand for 'multi-byte character sets'.

        Which should of course be 'The Multicode Standards:Everything including the (7 different) kitchen sinks'

        * Not even "U" has any understanding of Unicode. >perl -wE"say sprintf '%X', unpack 'U', pack 'U', 0x200000" 200000

        Wadday'know. If you pack with U and unpack with U you get back what you packed. D'uh. A pointless example of nothing much.

        This is the problem.

        perl -wE"$s=pack 'U*', 257; say length $s; print for unpack 'C*', $s;" 1 257

        That totally devalues the purpose of having two different template characters.

        • one for C   An unsigned char (octet) value.
        • one for U   A Unicode character number.  Encodes to a character in character mode and UTF-8 ... in byte mode.

        That should not happen. And I shouldn't have to state that I don't want it to happen:

        >perl -Mbytes -wE"$s=pack 'U*', 257; say length $s; say for unpack 'C* +', $s;" 2 196 129

        It breaks backward compatibility in the very worst way.

        • Screaming when you are doing nothing wrong.

          Breaking both existing, working code and existing expectations. And causing people to disable important and useful warnings to silence it.

        • And saying nothing at all when it does it wrong thing.

          Just silently breaking previously working, 'best practice' code violating every expectation and rule of change and enhancement.

        The Unicode Standard is a cock-up. And the Perl implementation worse.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
      I entirely agree that letting cavalier coding errors slip silently by is a Very Bad Thing.

      The pack function is one of those many built-in functions that is much improved by being wrapped with a fatalizing envelope. Something as simple as this should suffice:

      *CORE::GLOBAL::pack = sub ($@) { use warnings FATAL => "pack"; return CORE::pack(shift(), @_); };
      That will catch a lot of bugs that risk being carelessly ignored.

      Hope this helps.

        Hope this helps.

        Have you heard of chocolate teapots.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Simplest Possible Way To Disable Unicode
by SimonClinch (Chaplain) on May 24, 2011 at 15:09 UTC
    Assuming you are not yourself causing the packing of too much into a byte anywhere, in which case the warning is IMO fair, (like the chr(999) example cited), you could overload syswrite to unpack the data into bytes before invoking the real syswrite or failing that have a warning-free module that does nothing more than that.

    One world, one people

      ... you could overload syswrite to unpack the data into bytes before invoking the real syswrite...
      Which is just what encoding layers exist for:
      % perl -C0 -we 'syswrite(STDOUT, chr(0x500), 1)' | wc -c Wide character in syswrite at -e line 1. 0 Exit 255 % perl -CS -we 'syswrite(STDOUT, chr(0xE9), 1)' | wc -c 2 % perl -CS -we 'syswrite(STDOUT, chr(0x1000), 1)' | wc -c 3 % perl -CS -we 'syswrite(STDOUT, chr(0x01_0000), 1)' | wc -c 4 % perl -CS -we 'syswrite(STDOUT, chr(0x0100_0000), 1)' | wc -c 5 % perl -CS -we 'syswrite(STDOUT, chr(0x1000_0000), 1)' | wc -c 6 % perl -CS -we 'syswrite(STDOUT, chr(0xF000_0000), 1)' | wc -c 7 % perl -CS -M-warnings=portable -we 'syswrite(STDOUT, chr(0x10_0000_00 +00), 1)' | wc -c 13
Re: Simplest Possible Way To Disable Unicode
by JapanIsShinto (Initiate) on May 24, 2011 at 19:13 UTC

    Thanks everyone for your replies. The problem was in my code, specifically that I was at one point in the code generating a character value greater than 255. This was unintentional, and I throw myself at the mercy of the Perl Monks for my transgressions against software engineering.

    What led to the solution was the comment someone made about 'chr(999)'. My Unicode-challenged brain saw that and realized that I might accidentally be generating invalid character codes. So I wrote a loop over the string and had it dump out any characters that weren't 0 to 255. And there it was-- I had, essentially, inserted chr(0x4094) in a string I created.

    Still, I'm still curious if there is some way I can force Perl into a pre-Unicode state. My guess is that in the past, attempting to give chr() an argument greater than 255 would have resulted in an fatal error (or at least warning) that would have tipped me off to the problem. The advice I've been given here (and the documentation I've read) talks about Unicode support in Perl at different phases-- when reading Perl source code, when reading strings from files opened with a Unicode encoding, when comparing strings and characters, when writing to files, and probably other times as well. So to be clear, what I'd like is the One Ring That Will Rule Them All-- the incantation that tells Perl, "please turn it ALL off."

      Still, I'm still curious if there is some way I can force Perl into a pre-Unicode state.

      There's no way to disable support for wide strings. The closest is use bytes;, but it would have made your error harder to find.

      There's no way to prevent chr(0x4096) from working short of overriding chr.

      $ perl -E' use subs qw( chr ); sub chr(_) { my $ch = CORE::chr($_[0]); $ch lt "\xFF" or die; return $ch; } say chr for 65, 4096; ' A Died at -e line 5.

      My guess is that in the past, attempting to give chr() an argument greater than 255 would have resulted in an fatal error (or at least warning) that would have tipped me off to the problem.

      It did result in a fatal error, and despite being further down the line, the error message was spot on.

Re: Simplest Possible Way To Disable Unicode
by Discipulus (Curate) on May 25, 2011 at 09:07 UTC
    hello there!

    please dont let this topic die! this is a BIG problem, the situation from my perspective is:

    • some people in the past had done a big mistake (but "graves are plenty of 'day after knoweledge' someone understand this??.. ") presuming calculators had to speak a 24 letter idiom forever..
    • that hubrys provoked the kaos|god|nature|big number law's susceptibility and the ol'trick of Babel's Tower come into play again and someone stand up screaming: C r m
    • meanwhile a demiurg was crafting a powerful tool to Practically Export and Report LotOfThings..
    Eras has passed and now we have:

    • a lot of poor minded coders (with me in pole position..) mazed by complexity of dealing with subtle differences between two different point of viewing the some shaggy tail of bits
    • a lot of guru coders not so happy beacause some of their pack, syswrite or whatelse spells have lost the shining of primeval eras..
    • a crew of heroes trying to heroicly|naively resolve this babel releasing new versions of The Tool that consider the babel, but not completely, leaving the choice to disable this, or that, or use that other .. (they are fine leaving us some freedom: no sarcasm at all)..
    and no one is happy!

    Please some one write down the "Travel in Babel's lands with Perl in a pocket" tutorial.
    If can i add something I think the used semantic of the english term Encode is a little misleading for non english peoples..

    I discovered babel some times ago and i asked for wisdom about length in Size and anatomy of an HTTP response.
    As done there I invite everyone intersted to read (after the canonical texts: perluniintro, perlunitut, perlunifaq and perlunicode. )also The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) and http://perlgeek.de/en/article/encodings-and-unicode

    Lor*

    there are no rules, there are no thumbs..
      a lot of guru coders not so happy beacause some of their pack, syswrite or wha telse spells have lost the shining of primeval eras..

      That simply isn't what is going on here.

      The docs for pack say:

      • C   An unsigned char (octet) value.
      • W   An unsigned char value (can be greater than 255).
      • U   A Unicode character number.  Encodes to a character in character mode and UTF-8 (or UTF-EBCDIC in EBCDIC platforms) in byte mode.

      Now let's see what happens when we assign oversized values to other unsigned types:

      print unpack 'S', pack 'S', 65537;; 1 print unpack 'L', pack 'L', 2**32+1;; 1 print unpack 'Q', pack 'Q', 2**64+1;; 18446744073709551615

      It silently wraps (or truncates) as is expected and normal.

      Contrast that with what now (since the advent of unicode support) happens with unsigned char values:

      print unpack 'C', pack 'C', 2**8+1;; Character in 'C' format wrapped in pack at (eval 17) line 1, <STDIN> l +ine 9. 1

      A dumb warning that can only be disabled by disabling *all* pack warnings. Don't forget the 'W' and 'U' types above.

      It is perfectly reasonable to expect silent truncation of oversized values with unsigned char types ('C'). Just as was the case with 'C' before the addition of unicode support; and just as is still the case with all other unsigned types. This is not an error, nor "sloppy coding"; it is the norm for these types.

      Now constrast this spurious warning with the what happens when you use chr with oversized values:

      $s = chr( 257 );; print do{ use bytes; length $s, unpack 'C*', $s };; 2 196 129

      Perl silently accepts this error, and erroneously constructs a multi-byte character.

      And you only discover this error when you try to print it:

      print $s, length $s;; Wide character in print at (eval 19) line 1, <STDIN> line 11. &#9472; 1

      Which may not happen until dozens or hundreds of lines further on into the code; perhaps in another of your source files; perhaps in a module you didn't write or even know that you were (indirectly) using.

      That is the very worst kind of error situation: action at a distance.

      So, the problem is not (only) that this breaks "spells have lost the shining of primeval eras", but rather that the current, here today and tomorrow, state of play is that Perl issues spurious warnings for code that has always (and still should by the evidence of other similar current operations) be considered normal. Whilst silently not just ignoring a possible programmer error, but then making asinine assumptions and implementing the wrong thing, in a way that means such errors are horribly difficult to track down.

      You cannot have it both ways. Fobbing this off with "documentation error" or "ancient sloppy coding practices" doesn't cut it.

      Either *all* oversized assignments to unsigned types should silently truncate; or *all* should warn.

      Either chr should be only for 8-bit bytes and attempts to set oversized values should warn in-situ or chr should accept multi-byte ordinals and print should know how to handle them.

      Except the latter is impossible because Unicode is such a crock.

      One solution would be to add a wchr function that accepted multi-byte ordinals. That would make it very clear that the programmer is expecting to program with MBCSs and allow chr to catch coding errors at source.

      Another, in my opinion preferable, solution would be to have it so that pre-unicode support semantic were followed everywhere, unless a use Unicode; statement was seen.

      Ie. Instead of having to try (and fail) to disable these changes when you don't want them with use bytes;, when you want Unicode semantics, you ask for them. Seem logical?

      Unfortunately, it is too late for that.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        ..ohhh

        I choosed to speak ironically (spell, shine, ..) exactly because I had not a clear idea about what was going on..
        thanks for the explanation.

        Lor*
        there are no rules, there are no thumbs..
Re: Simplest Possible Way To Disable Unicode
by Anonymous Monk on Sep 02, 2011 at 06:16 UTC

    @JapanIsShinto: I've been in your boat today wearing your shoes.

    I recently upgraded Perl and now some of the scripts that I've been running have been dying with the dreaded "Wide character in syswrite at blahblah.pl line nnnn." message.

    In one particular case I was using LWP::Simple's get() method to retrieve an image (note: an image is binary data) and write it to disk. Previously this was working with...

    my $content = get($url); # other stuff... die "Could not write: $filename\n$!" unless open(my $fdOut, '>', $file +name); binmode($fdOut); syswrite($fdOut, $content, length($content)); close($fdOut);

    Oh so simple and oh so effective. Enter Unicode... grumble.

    Contrary to all the comments I can see on this thread... sometimes a string of bytes is just a string of bytes. Neither Perl nor anything else should be trying to interpret it as a string of characters. In the end, the solution was using Encode::_utf8_off() like the following...

    use Encoding; my $content = get($url); Encode::_utf8_off($content); # other stuff... die "Could not write: $filename\n$!" unless open(my $fdOut, '>', $file +name); binmode($fdOut); syswrite($fdOut, $content, length($content)); close($fdOut);

    Hope this helps...

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://906373]
Approved by graff
Front-paged by John M. Dlugosz
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others avoiding work at the Monastery: (6)
As of 2014-09-22 22:38 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (206 votes), past polls