Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Printing the first letter of the Hebrew alphabet (U05D0) kills script?

by ELISHEVA (Prior)
on Mar 08, 2011 at 15:42 UTC ( #892034=perlquestion: print w/ replies, xml ) Need Help??
ELISHEVA has asked for the wisdom of the Perl Monks concerning the following question:

This is clearly not a Perl problem (or at least I don't think so), but I can't think of a better place to get an understanding of what is happening.

If I run a bash shell using xterm -u8 (-u8 turns on utf8 mode for xterm), the following line in a Perl script appears to make a script die without even executing the end blocks:

#print first letter of Hebrew alphabet (aleph) my $ch=chr(0x5D0); print STDERR "$ch\n";

This is just an appearance. In reality the line doesn't kill the script at all. I was able to run the same script in an xemacs command shell, and it is clear that the script is running to completion. (see below for test code and output). I can also avoid sudden death by starting up xterm in wide character mode (xterm -u8 -wc).

I'd like to understand why U0D50 causes sudden death when wide character mode is off. Here in Israel, the first letter of the Hebrew alphabet (aleph) isn't exactly an exotic character. The more serious problem is that any test script I have also goes silent and appears to die if it prints out a diagnostic that contains that character unless it is running in a specially configured terminal. Not good.

Other utf8 characters sometimes display two characters where I expect 1, or display the wrong glyph (or a placeholder box). I could understand ugly output, but what is special about U05D that would make a terminal think it should stop displaying output sent to STDOUT and STDERR?

Also if there are any Israeli monks out there (or Hebrew speaking monks from other parts of the world) reading this who are familiar with this issue and have a work around they use, please speak up!

Platform details:

Debian (Lenny) system perl (5.10.0) xterm version: XTerm(235) bash: GNU bash, version 3.2.39(1)-release (i486-pc-linux-gnu)

Test script:

use strict; use warnings; use PerlIO; use Devel::Peek; my $ch=chr(0x5D0); Devel::Peek::Dump($ch); binmode(STDERR); print STDERR "layers for STDERR: @{[PerlIO::get_layers(STDERR)]}\n"; print STDERR "$ch\n"; #complains about wide character binmode(STDERR, ":utf8"); print STDERR "layers for STDERR: @{[PerlIO::get_layers(STDERR)]}\n"; print STDERR "$ch\n"; # no complaints here print STDERR "I survived :-) !!!\n"; print STDOUT "I really did. I really did.\n"; # End blocks to help verify that STDERR output is being # truncated, and script is not merely aborting END { warn "Ah...dead\n"; } END { warn "I'm dying :-( \n" }

Output in Xemacs shell:

SV = PV(0x817c6d0) at 0x8197e90 REFCNT = 1 FLAGS = (PADMY,POK,pPOK,UTF8) PV = 0x819e970 "\327\220"\0 [UTF8 "\x{5d0}"] CUR = 2 LEN = 4 layers for STDERR: unix perlio Wide character in print at Monks/Foo.pm line 916. \220א layers for STDERR: unix perlio utf8 \220א I survived :-) !!! I really did. I really did. I'm dying :-( Ah...dead

Output on xterm -u8 -wc (widechar on) - output is the same as xemacs except that U05D0 prints as "" not "\220"

SV = PV(0x817c6d0) at 0x8197e90 REFCNT = 1 FLAGS = (PADMY,POK,pPOK,UTF8) PV = 0x819e970 "\327\220"\0 [UTF8 "\x{5d0}"] CUR = 2 LEN = 4 layers for STDERR: unix perlio Wide character in print at Monks/Foo.pm line 916. layers for STDERR: unix perlio utf8 I survived :-) !!! I really did. I really did. I'm dying :-( Ah...dead

Output on xterm -u8 (widechar off). Notice how everything after the wide character warning all output to STDOUT and STDERR disappear as if U05D) causes STDOUT and STDERR to close. Note that it does not hang. The script just terminates with no further visible output and a prompt for a new command appears.

$ perl myscript.pl SV = PV(0x817c6d0) at 0x8197e90 REFCNT = 1 FLAGS = (PADMY,POK,pPOK,UTF8) PV = 0x81a2560 "\327\220"\0 [UTF8 "\x{5d0}"] CUR = 2 LEN = 4 layers for STDERR: unix perlio Wide character in print at Monks/Foo.pm line 916. $

Note: switching the order of output so that output to the STDOUT w/ a utf8 layer comes first does not improve the situation. Instead of dying after the warning, it dies silently on the print statement.

$ perl myscript.pl SV = PV(0x817c6d0) at 0x8197e90 REFCNT = 1 FLAGS = (PADMY,POK,pPOK,UTF8) PV = 0x819f610 "\327\220"\0 [UTF8 "\x{5d0}"] CUR = 2 LEN = 4 layers for STDERR: unix perlio utf8 $

Update: clarified that the script terminates with no further output and does not hang.

Comment on Printing the first letter of the Hebrew alphabet (U05D0) kills script?
Select or Download Code
Re: Printing the first letter of the Hebrew alphabet (U05D0) kills script?
by kennethk (Monsignor) on Mar 08, 2011 at 17:02 UTC
    Character 0x05 corresponds to ENQ in ASCII (0xD0 is Ð in CP-1252). My guess is that the ENQ is hanging your terminal, waiting for an ACK. If so, you may be able to get things working by manually sending your ACK with ^F. This is untested, and highly speculative. Start your wiki surfing with Enquiry_character.

      At this point all ideas help - speculative or untested or not.

      However, I wasn't as clear as I could have been in my original post. The terminal does not hang. It just stops displaying any output and returns immediately to the prompt. So there is no opportunity to issue a manual ^F of any sort. I've updated the OP to make it clearer.

        Do you get the same problem with bet (0x05D1)? How about leftwards double arrow (0x21D0, where 0x21 is ! in ASCII)? If it's a single character issue, exploration of the parameter space might help out.
Re: Printing the first letter of the Hebrew alphabet (U05D0) kills script?
by BrowserUk (Pope) on Mar 08, 2011 at 17:37 UTC

    More pure speculation, but what happens if you redirect the output to a file rather than letting it go to the screen?

    My thought is that this might isolate whether the problem occurs within perl, or within the screen driver.

    You might also try knocking up a C program to output the same bytes and see if that has the same affect on the screen driver.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      If the output is sent to a file, then it is all there. Same as when I run the script in an xemacs shell (see above). It definitely seems to be something related to xterm and not Perl. I'm beginning to think that xterm thinks certain byte squences are meant to be terminal commands (see my latest reply to kennethk). Maybe it is a pre-Unicode days "feature" with unintended consequences in multi-byte character world?

      I very much like the idea of confirming the xterm behavior with a short C program. Good lateral thinking. Thanks!

        I'm on completely unknown (to me) ground here now, but do xterm's retain the old-fashioned serial port configuration parameters?

        What I'm getting at is that it used to be possible to configure terminals for 7-bit or 8-bit; odd/even/no parity etc.

        If you sent unicode to a terminal that was configured to expect 7-bit input, it might strip the 8th bit. And byte value \220 (decimal 144) suddenly becomes ascii 16 which is right in amongst the device control characters often used for X-on/X-off and similar. (It's not one of those two, but who knows what others there were back in the day?)


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Printing the first letter of the Hebrew alphabet (U05D0) kills script?
by ikegami (Pope) on Mar 08, 2011 at 18:30 UTC

    Can't reproduce with my own build (default config) of Perl 5.10.0 on a Debian machine.

    $ perl -v This is perl, v5.10.0 built for i686-linux ... $ xterm -v XTerm(261)
    SV = PV(0x8ea9040) at 0x8ebaf50 REFCNT = 1 FLAGS = (PADMY,POK,pPOK,UTF8) PV = 0x8eb6de8 "\327\220"\0 [UTF8 "\x{5d0}"] CUR = 2 LEN = 4 layers for STDERR: unix perlio Wide character in print at a.pl line 11. א layers for STDERR: unix perlio utf8 א I survived :-) !!! I really did. I really did. I'm dying :-( Ah...dead

    "א" represents some squiggle.

    Some bug in xterm?

      What happens when you run the script with wide-characters explicitly turned off, i.e. in an xterm launched with xterm -u8 +wc? For me, wide character xterm windows show all output. It could be that you aren't seeing the same results because your system is configured by default to have wide characters turned on and you need to explicitly turn it off to get the results I'm getting. I'm only seeing STDOUT and STDERR disappear when wide characters are turned off (see sample output in OP).

      It could be a bug, but I'm beginning to think that it may in fact be a "feature" left over from the pre-unicode days of the computing world. xterm, at least in in version 235, seems to think certain byte sequences are escape sequences meant to control the terminal. See my latest reply to kennethk.

        What happens when you run the script with wide-characters explicitly turned off, i.e. in an xterm launched with xterm -u8 +wc?

        No difference whatsoever.

        seems to think certain byte sequences are escape sequences meant to control the terminal.

        That may be.

        The things is, those normally start with ESCape (^[). UTF-8 doesn't produce anything that contains ESC except for ESC itself. Other control character respected by terminals are also found in the ASCII range and thus not produced by UTF-8.

        I don't know much about terminals, and less about xterm. I didn't even have xterm installed until this came up.

        You mentioned something about an "Xemacs shell". Is that a variable that can be eliminated?

Re: Printing the first letter of the Hebrew alphabet (U05D0) kills script?
by Eliya (Vicar) on Mar 08, 2011 at 22:14 UTC

    You could also try the -en (encoding) option:

    $ xterm -en UTF-8

    Also, what are your locale settings?

    xterm's command line options u8, wc, lc, en and its X resources utf8, locale, wideChars are interdependent in various ways, and some combinations depend on the locale settings (see the xterm man page for details), so it might well be that -u8 doesn't have the expected effect in your specific environment...

    P.S. I can reproduce your problem with xterm v236 (SUSE 11.1 system) when I run it without any options.  Virtually all other sensible combinations of the above mentioned options, however, either work fine (i.e. proper glyph is being displayed), or show the Latin-1 replacement 'x', but without aborting further output.

    Interestingly, I cannot replicate your problem when I use xterm v235 (the debian lenny build) on my SUSE system (I currently don't have a debian system within reach).  In some cases it says "Warning: couldn't find charset checkfont; using ISO 8859-1", in which case I get the 'x' replacement, but if I specify -en UTF-8 everything works fine.

    BTW, see also luit.

      ++

      xterm -en UTF-8 made everything work. No lost/hidden output and I even see the Hebrew glyphs. Yeah! Now I not only have an explanation for the weird behavior, but a way to get everything to work just as I want it. -wc was giving me the output, but not the glyphs.

      It has been a good day. Thank-you.

      Update: I thought add a couple notes on configuring Xterm so that one need not type xterm -en UTF every time one starts a shell.

      Each flavor Linux seems to have its own locations for XTerm configuration files and figuring out the ones that were right for my system took some searching. Also web pages are a bit confusing on this matter because xterm appears to have undergone some development. -u8 is part of an older way of managing utf8 and is not well integrated into the current way xterm handles encoding issues. Newer versions of xterm use -en on the command line and locale in a configuration file.

      For Debian (Lenny) the important facts are:

      • machine/site-wide configurations are in /etc/X11/app-defaults/XTerm
      • personal xterm configurations are in ~/.Xdefaults Note: some webpages say the personal configuration file is ~/.Xresources. Ignore them if you are using Debian. For non-Debian systems YMMV. You may be able to figure out what your own system requires by checking the end of the man page for xterm that ships with your system.
      • The following line needs to be added to either the site or personal configuration file: XTerm*locale: UTF-8. That one line is equivalent to -en UTF-8 on the command line.
      • By default, xterm assumes that any input to the terminal via keyboard or via program output will be UTF-8 characters. If that is the case, one need not set LANG, LANGUAGE, LC_ALL or LC_CTYPE to make xterm happy (other applications may need them, just not xterm).
      • If characters are represented as something other than UTF-8, then one must set LC_CTYPE to the encoding used by the keyboard/program output to the terminal. xterm uses the value of this variable to help it process non UTF-8 input.
Re: Printing the first letter of the Hebrew alphabet (U05D0) kills script?
by ELISHEVA (Prior) on Mar 08, 2011 at 22:37 UTC

    Well, it looks like I'm beginning to piece together an explanation. Since all three replies (ikegami, BrowserUK and kennethk are converging in the same direction, I'm going to summarize what I know so far on a new comment.

    1. The strange behavior seems to be a result of xterm treating certain byte sequences as terminal control sequences (see Re^4: Printing the first letter of the Hebrew alphabet (U05D0) kills script?here] for details), but none of us are sure what they are because, at first glance, the character sequence doesn't fit the normal 7-bit escape sequences that begin with ESC [

    2. However, some terminals support 8-bit control characters as an alternative to ESC [ (see http://rtfm.etla.org/xterm/ctlseq.html.

    3. It so happens that one of those 8-bit sequences is 0x90 (Device Control String). It also so happens that 0x5D0 has a byte representation of 0xd7  0x90. Perhaps xterm is seeing the 0x90 and instead of recognizing it as the second byte in a multibyte character, it understands 0x90 as the first byte in a device specific control string? As a result all of the output from Perl gets interpreted as some sort of device command until the next 8-bit control character shows up. That would explain why 0x05D0 (d7 90) stops output and a subsequent 0x05D1 (d7 91) or 0x05D2 (d7 92) resumes it. The 8-byte control characters fall in the range of 0x84-0x9f.

    4. In theory this shouldn't be happening on a utf8 terminal (xterm -u8). Xterm should know not to pluck 8-bit control characters from the middle of multibyte unicode characters. That makes me think that maybe what I'm seeing on xterm, version 235 is either (a) a bug in unicode parsing or (b) a bug in xterm's validation of configuration that allows two incompatible properties to exist (utf8 and 8-bit control sequence indicators). Interestingly, ikegami ran my test script on a later version of xterm and could not reproduce the strange behavior. This is suggestive of a bug that was found and fixed. But it could also mean that we simply have different xterm configurations.

    There are still details to iron out. In particular - explaining the specific behavior I noted for each key combination, but I'm fairly satisified that this is in the right ballpark and relieved that this is likely a temporary version specific problem and not fundamental fact of life about Hebrew unicode and xterm.

    I'd like to point out that every single piece of this was in some way suggested by one of the three people responding to this thread. To kennethk I owe thanks for making me look more closely at the behavior of other codepoints in the same vicinity as 0x5D0. browserUK put the final nail in xterm's coffin by giving me yet another way to prove that the symptoms were linked to destination of the output and not the generation within Perl. His comment about terminal parity got me looking more closely at what happens when you look at the pieces of a multibyte character. ikegami's testing on a later version of xterm made it clear that at least one later version of xterm managed to be well behaved even when wide character mode was off. Therefore any bad behavior was fairly viewed as a bug rather than a necessary evil.

    What I really like about this thread is the way we've all been speculating and yet that speculation has lead to a proposed explanation.

    Update: While I was writing my reply here, ikegami was coming to the similar conclusions. See Re^5: Printing the first letter of the Hebrew alphabet (U05D0) kills script?.

Re: Printing the first letter of the Hebrew alphabet (U05D0) kills script?
by LanX (Canon) on Mar 08, 2011 at 23:30 UTC
    > Here in Israel, the first letter of the Hebrew alphabet (aleph) isn't exactly an exotic character.

    Neither for mathematicians. :)

    see aleph-null

    Cheers Rolf

      <sarcasm>Except א (codepoint 0x05D0) and ℵ (codepoint 0x2135) are totally different. When did you last visit your neighborhood ophthalmologist?</sarcasm>

      Update: Added omitted sarcasm tags. Though I discovered some interesting browser rendering behavior after typing the characters &#1488; (0x05D0)

        What?

        Update:

        I see ... you're seriously thinking that mathematicians are restricted to unicode ranges deriving from old LaTeX fonts like "symbol"...

        Cheers Rolf

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://892034]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others musing on the Monastery: (4)
As of 2014-12-20 09:03 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (95 votes), past polls