http://www.perlmonks.org?node_id=330500

Gorby has asked for the wisdom of the Perl Monks concerning the following question:

Hello Wise Monks.

What does it mean when someone says send me the actual "bytes" through the socket. I only know how to send ascii characters through sockets. How do I send bytes?

Thanks in advance.

Gorby

Edited by BazB: changed title from "bytes".

Replies are listed 'Best First'.
FMTYEWTK about Characters vs Bytes
by halley (Prior) on Feb 20, 2004 at 15:49 UTC

    Characters are NOT the same as bytes.

    The term character is a logical term (meaning it defines something in terms of the way people think of things). The term byte is a device term (meaning it defines something in terms of the way the hardware was designed). The difference is in the encoding.

    Encoding

    The character 'A' must be encoded into a certain bit pattern that the machine can use. This is a mostly arbitrary decision on the part of hardware implementors. You could say that A=1, B=2, and so on. For ASCII (the American Standard for Computer Information Interchange), designers chose a 7-bit encoding (A=1000001), giving room for encoding 128 different characters in those seven bits.

    Since a byte is 8-bits on most hardware today, the hardware just pads the extra bit with a zero (A=01000001). So even in the case of simple ASCII, characters are not the same thing as bytes.

    ASCII reserves 32 encodings for special characters like the TAB, NEWLINE, FORMFEED, and then the remaining encodings are for printing characters like SPACE, !, @, #, $, A-Z, a-z, 0-9, and so on.

    Beyond ASCII

    What can the eighth bit be used for? It gives 128 more character encoding numbers. Each manufacturer has used the other 128 encodings above ASCII (values 128-255) for special printing characters, such as GRAY-BLOCK, N-WITH-TILDE, and YEN. They've been pretty horrible about consistency, though: IBM's GRAY-BLOCK character is not the same number as Commodore PET's GRAY-BLOCK character. Some computers have a YEN symbol while others don't.

    ASCII is not the only way to encode characters into bit patterns. EBCDIC was a popular eight-bit encoding scheme used on some mainframes. If you transferred a file from one machine to another, you would have to convert each encoded character according to a look-up table, so A=193 (EBCDIC) would end up in the new file as A=65 (ASCII), or vice versa. Character types that were not in common between the two encodings would have to be dropped or assigned a replacement value, essentially destroying information.

    There are special-purpose encodings for certain applications. On the DEC PDP-11, which usually interacted with the user in ASCII, another encoding called ROT50 was a common way to pack 3 letters and digits into two bytes. It could not fit any punctuation or special characters, so it was suited only for specific tasks. FORTRAN compilers often used it to pack six-letter variable names in less space. This also allowed "six-dot-three" filenames to fit in six byte data records on the storage devices. "Eight-dot-three" filenames came later.

    Windows and MS-DOS offered code pages to help international users fit their most important data into one-byte character encodings. If you were using a Russian code-page, then a byte value of 136 meant one thing, but it meant an entirely different thing in the Netherlands' code-page. Web pages use a similar scheme to provide support for different character sets like ISO8859-1.

    The problem of defining content on a certain character set gets worse in a global community. Unicode was designed to replace all that swapping around of character sets and make one canonical encoding. Every known character in every language would get its own permanent number.

    Unicode

    Unicode has several tens of thousands of characters, from Hebrew to Arabic to Kanji to Cyrillic. It has more control characters to deal with the minimum text positioning requirements of various languages, such as the doubled layering of Japanese Kana over Kanji which helps readers with uncommon words.

    The numbers range into the 60,000s. You clearly can't encode those into single bytes anymore. Even if you don't have the which-language-is-supported character set problem, you still haven't gotten rid of the encoding problem. You never will: characters are human concepts and bytes are device concepts.

    The brute force encoding would just take, say, four bytes for every character. You still would have to agree on whether the least-significant byte is written first or last, but there's room for every conceivable current and future Unicode character number. Oh, but it wastes a LOT of space, since the vast majority of commerce can still just fit in the ASCII range. Just saying hello takes twenty bytes.

    To optimize the storage, a few specialized encodings have come up. The Unicode character numbers don't change, but the way the numbers are packed into bytes and bits are changed. Instead of four bytes per character, which was really excessive, let's do two bytes, and if it just happens to be one of those rare numbers that won't fit in two bytes, use some reserved bit flags to indicate that the rest of the number follows in the third byte. It's getting pretty complicated to encode or decode strings, but it saves a lot of space.

    UTF-8

    Currently, the most popular Unicode encoding is even tighter. It's called UTF-8. It allows any all-ASCII content to remain in the old one-byte-per-character encoding, completely unchanged. That's popular and efficient for all that ASCII content. The last bit doesn't offer you 128 more characters, it tells the Unicode program's decoder that it's a non-ASCII character, and that some more bits are in the next byte. Or the third byte. Or the fourth byte. Or as far away as requiring six bytes to just grab a single Unicode character with a high number.

    The benefit of UTF-8 takes advantage of the fact that Unicode associates the more common world characters with lower numbers, so UTF-8 requires less space to encode those character numbers. Rare or specialized characters from languages like Klingon, Feanorian (Tolkien Elvish) Tengwar, Heiroglyphics or Cuneiform, will take more space to store.

    Beyond UTF-8

    There will still be application-specific encodings, even using Unicode's numbering scheme. Just like ROT50 was designed to store FORTRAN variable names and filenames in less space, an encoding called "puny code" (or "P-Unicode") can encode just about any Unicode string by using only the few byte values allowed in a domain name registration record. The domain registrar may see "egbpdaj6bu4bxfgehfvwxn.com" but a Punycode-compliant decoder in a modern web browser would decode those characters and render a domain name spelled with flowing Egyptian Arabic characters.

    --
    [ e d @ h a l l e y . c c ]

      Nice piece halley++. Clear, informative, and easy to read. Not quite FMTYEWTK status though, I finished and immediately wished for a next page link :). Might you consider a full blown tutorial that continues in the same style, moving into usage and implementation details in Perl?
      indeed... but back to the original question...the socket does not care whether the program is treating your data as a bitvector or operating on the higher level representation of such,i.e. ascii or any other symbol/character set...

      warning: munging binary data for extended periods may cause severe headaches and aggressive, anti-social behaviour when agitated

      you are all a bunch of idiots...

Re: What is the difference between sending bytes and characters over sockets?
by PodMaster (Abbot) on Feb 20, 2004 at 11:04 UTC
    Characters are made up of bytes (one or more depending on type of encoding, ascii uses one byte per character), but they're both made up of bits :)
    warn unpack 'b*', 'a'; warn pack 'b*', '10000110'; __END__ 10000110 at - line 1. a at - line 2.
    The bits/bytes stuff is basic CS knowledge (you should google for a tutorial).

    MJD says "you can't just make shit up and expect the computer to know what you mean, retardo!"
    I run a Win32 PPM repository for perl 5.6.x and 5.8.x -- I take requests (README).
    ** The third rule of perl club is a statement of fact: pod is sexy.

Re: What is the difference between sending bytes and characters over sockets?
by pboin (Deacon) on Feb 20, 2004 at 12:40 UTC

    All characters are bytes. This sentence is full of characters that are bytes.

    The only thing that might be throwing you is that some bytes don't have character representations. Go search on 'ASCII table', and you'll see that out of the 256 posible values a byte can hold, some are what you'd think of as 'printable' (for lack of a better word) characters and some are not.

    If that doesn't answer your question, you'll need to post more context so we can understand the situation. But essentially: bytes eq characters.

      But essentially: bytes eq characters.

      Except when they don't (like in many non-ASCII character sets). You can't even be sure that UTF8 will actually be one byte per character (because there is an escape code in it for multibyte encoding).

      ----
      : () { :|:& };:

      Note: All code is untested, unless otherwise stated

Re: What is the difference between sending bytes and characters over sockets?
by tilly (Archbishop) on Feb 20, 2004 at 22:18 UTC
    You really need to find out what the difference is between what you sent, and what that person expected to see.

    If you are on Windows, you very well may need to use binmode to be sure that your end of line doesn't get translated into 2 characters. You might be sending numbers and they need the output of a pack statement (see also unpack) rather than a text representation of the number. There are stranger possibilities. For instance the program at the other end of the line might be written in C, and might be looking for the end of your string to be a null byte (you write that as "\0"). Or you could unexpectedly be sending data in UTF-8 format and need to turn that off.

    The description makes it unclear how to solve the problem. You need to first get more detail about how what you sent doesn't match what is needed, and then figure out how to solve that.