Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

Re^7: Standard handles inherited from a utf-8 enabled shell

by repellent (Priest)
on Mar 22, 2012 at 04:41 UTC ( #960923=note: print w/ replies, xml ) Need Help??


in reply to Re^6: Standard handles inherited from a utf-8 enabled shell
in thread Standard handles inherited from a utf-8 enabled shell

That's not how I see it. I see the system-ed perl as an autonomous process (unknowing of its parent process) with its STDOUT filehandle set with different encodings.

In both cases, we're printing out a string with one character at codepoint U+00FF.

The second system-ed perl has its output encoding set to UTF-8 (via -CO). What octets do we send out into the cruel world for U+00FF character encoded in UTF-8? Ans: c3 bf.

The first system-ed perl has its output "set" to byte/Latin-1 encoding (the default). What octets do we send out into the cruel world for U+00FF character encoded in Latin-1? Ans: ff.

The first case did not print c3 bf just because of the parent perl -CO because the system print did not go through the parent's perlio.


Comment on Re^7: Standard handles inherited from a utf-8 enabled shell
Select or Download Code
Re^8: Standard handles inherited from a utf-8 enabled shell
by BrowserUk (Pope) on Mar 22, 2012 at 05:59 UTC
    I see the system-ed perl as an autonomous process (unknowing of its parent process)

    As you are probably aware, system is equivalent to fork followed by exec.

    You are also probably aware tha fork preserves open file descriptors. This is why to create a daemon, it is necessary to fork twice. You fork once, close the standard handles in the child; and then fork a second time. Only then does the second child become disassociated with the terminal and a true daemon.

    What you may not be familiar with is that (various forms of) exec are front end for execve. And that execve() also preserves open file descriptors. (Except those marked close-on-exec.)

    To quote the above man page:By default, file descriptors remain open across an execve().

    You can prove this to yourself. Run this one-liner (suitably adjusted):

    perl -e"system qq[ $^X -e\"\$n=123; print \$n\" ];" 123

    And you'll see the output 123

    Now try this modified version:

    C:\test>perl -e"close STDOUT; system qq[ $^X -e\"$n=123; print \$n\" ] +;"

    Where did the output disappear to?

    So bang goes the autonomous process theory.

    In both cases, we're printing out a string with one character at codepoint U+00FF.

    No. The return value from pack 'B8', ... is not a character; nor a codepoint; and absolutely nothing to do with Unicode.

    It is a byte! An 8-bit unsigned number bit pattern stored in a 8-bit unit of memory and nothing else.

    No interpretation of the meaning (nor even signedness) is placed (nor could be) upon that value until you do something with it!

    The second system-ed perl has its output set ...

    You're right that the interpretation applied to the 8-bit value is not preserved across the fork/exec pair, but not because of your reasoning.

    The important part is that the OS cannot preserve what it has no knowledge of. There is no concept of encoding attached to the file descriptors.

    It is also likely, though I haven't confirmed this, that Perl reopens the standard handles when it starts.

    The bottom line -- for this thread, rather than this subthread -- is that the OP must have omitted some details from his scenario.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    The start of some sanity?

      You mentioned (emphasis mine):

        No interpretation of the meaning (nor even signedness) is placed (nor could be) upon that number until you do something with it!

      I agree with this, but I believe we have different assumptions on what is meant by interpretation. Look, I need a way to refer to that number, because that is fundamental. I call that number a "character". The value of that number is what I call the "codepoint value". Bear with me: forget "Unicode" for now, and grant me the use of those words. At any time, you may s/character|codepoint/_that_number_/gi.

      Before that sentence, you mentioned:

        It is a byte! An 8-bit bit pattern stored in a 8-bit unit of memory and nothing else.

      Well, that number is 255 == ord(pack 'B8', '11111111'). Saying it's a (single) byte means you've established the number of bits for it is 8. That, to me, is giving the number an interpretation(*). This observation is very important when it comes to the subject of encoding, especially when we're to print that character (i.e. that number).

      If you want to print a string, you should avoid any preconceived notion of how many bits the string "has" prior to deciding which encoding to use. I find thinking in terms of characters (i.e. those numbers) and what their codepoint values (i.e. the number values) are, helps tremendously in my handling of strings up to the point where they are encoded using print. That is my thought process, and the message I was trying to deliver.

      (*) I am aware of the details of how perl stores that number in memory, but not as well versed as you. I would like to reiterate that this discussion is about print and encoding, and that the ordinal of the character is what matters here.
        The important part is that the OS cannot preserve what it has no knowledge of.

      Agreed.

        There is no concept of encoding attached to the file descriptors.

      And that's the thing: the concept of encoding alone does not make sense without the concept of characters (what we're encoding). And those characters can only exist within the process (e.g. numbers in Perl's "string"). Our computer "systems" (e.g. web browser, text editor, terminal, program, etc.) do this decode-incoming-octets-then-output-octets-already-encoded dance between each other to handoff characters.

      When Perl warns you about "Wide character in print", what it's really saying is: Please be explicit about the encoding so that I can tell the next "system" about my characters accurately, using only octets.

        The bottom line -- for this thread, rather than this subthread -- is that the OP must have omitted some details from his scenario.

      Agreed.
        Well, that number is 255. Saying it's a (single) byte means you've established the number of bits for it is 8.

        No. You've got that backward. At the point the value is returned from pack, it isn't even a number. It is just 8 bits.

        They could represent anything, including 8 physically grouped but otherwise unrelated discrete boolean values -- the current on/offness of the headlights, sidelight and tail-lights on a car; yes/no answers to a survey.

        Referring to (not interpreting as) that bit pattern using 255/0377/xff is just easier than 0b11111111.

        That, to me, is giving the number an interpretation. This observation is very important when it comes to the subject of encoding, especially when we're to print that character (i.e. that number).

        Sorry, but you are assuming that the 8-bits represents something to do with "strings & characters and codepoints". It could just as well be 1 byte of a 4 or 8 byte memory address; or part of an IP address; or a sound level ...


        The point of my asking the question was trying to make sense of the OP's (of the other thread), description. I knew that I couldn't replicate his apparent scenario on my system, but I am not familiar with the working of Unicode on *nix.

        It was conceivable to me that, when running on a "unicode enabled terminal", there might be some default interpretation the byte values printed to that terminal that might be inherited by processes spawned from that terminal.

        I am informed that there isn't!

        But it was vaguely conceivable that there might be. And that might have been an explanation for the OPs apparent problem.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        The start of some sanity?

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://960923]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others imbibing at the Monastery: (8)
As of 2014-07-23 01:40 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (131 votes), past polls