Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Jargon relating to Perl strings

by ikegami (Pope)
on Jan 17, 2012 at 03:03 UTC ( #948243=perlmeditation: print w/ replies, xml ) Need Help??

These are terms I use to refer to certain element or properties of Perl strings. I'm posting this as a reference, not to advocate them.

Basics

Character, or string element

An element of a string, as in what substr($s,$i,1) returns. It's a 72-bit value in theory, but it's limited to the size of a UV in practice.

"Character" is the term used to document Perl functions that work on arbitrary strings (substr, index, reverse, chr, ord, etc) and it's the term used in Wikipedia's definition of "string".

String value

The sequence of string elements in a string (irrespective of the choice of storage format used for that string).

String element semantics

Byte

A string element whose value is understood/expected to be in [0, 255] (irrespective of the choice of storage format used for that string).

Code point, or Unicode code point

A string element whose value is understood/expected to be a Unicode code point (irrespective of the choice of storage format used for that string).

String semantics

Bytes, or string of bytes

A string whose value is understood/expected to be a sequence of values in [0, 255] (irrespective of the choice of storage format used for that string).

Text, or decoded text

A string whose value is understood/expected to be a sequence of Unicode code points (irrespective of the choice of storage format used for that string).

String storage formats

UTF8=0 storage format

The format of the PV in a string whose UTF8 flag is clear (0).

It's unambiguous, but it's quite a mouthful. Some use "byte string", but those who do tend to also use it for what I call "string of bytes".

UTF8=1 storage format

The format of the PV in a string whose UTF8 flag is set (1).

It's unambiguous, but it's quite a mouthful. Some use "character string", but that's incorrect because all strings are made of characters by definition.

Other

The Unicode Bug

If changing the internal storage format of a string changes how a piece of code behaves, that code suffers from The Unicode Bug.

For example, the following code suffers from The Unicode Bug.

use feature qw( say ); use Inline C => <<'__EOI__'; STRLEN mylength(SV* sv) { STRLEN len; (void)SvPV(sv, len); return len; } __EOI__ $x="\xE9"; utf8::downgrade($x); $y="\xE9"; utf8::upgrade($y); say $x eq $y ? "equal" : "not equal"; # equal say mylength($x); # 1 say mylength($y); # 2

Others related terms I've seen used

Byte string

This usually refers to the UTF8=0 storage format, but it could also refere to a string of bytes.

Character string

This usually refers to the UTF8=1 storage format. The term is incorrect since all strings are made of characters by definition.

Byte semantics

This usually refers to how code behaves when given a string in the UTF8=0 storage format, in distinction to how it behaves when given a string in the UTF8=1 storage format. Code that make such a distinction suffer from The Unicode Bug.

Character semantics

This usually refers to how code behaves when given a string in the UTF8=1 storage format, in distinction to how it behaves when given a string in the UTF8=0 storage format. Code that make such a distinction suffer from The Unicode Bug.

Update: Changed "regardless of the value of its UTF8 flag" to something clearer in response to JavaFan's and wrog's comments.
Update: By request, added end tags for DT, DD and P elements even though they are optional.

Comment on Jargon relating to Perl strings
Select or Download Code
Re: Jargon relating to Perl strings
by JavaFan (Canon) on Jan 17, 2012 at 10:53 UTC
    I find your definition of byte confusing, and I think most people use it differently. According to your definition,
    $x = "\xEC"; utf8::upgrade($x);
    now $x consists of a single byte. Even though it requires 16 bits of encoding.

    Perhaps the confusion comes from saying that for your definition of a byte, the UTF8 flag doesn't matter, yet it refers to a string element, which is defined in terms of substr, for which the UTF8 flag *does* matter.

    I'd say that in my example, $x ends up having 2 bytes, but one character. This is also the difference wc makes.

    Of course, you are free to use whatever definition you want -- just do mind that not all people share your definition. Some people prefer not use the term byte at all, just character and octet.

      now $x consists of a single byte.

      Yes, that's what I call a byte. So maybe it's my definition, not my term that's unclear.

      which is defined in terms of substr, for which the UTF8 flag *does* matter

      Ah, there's the problem. "The UTF8 flag doesn't matter" means different things to us. For a given string, substr will always return the same value regardless of the UTF8 flag, so I say the UTF8 doesn't matter to substr.

      my $flag_is_0 = "\xC9ric"; utf8::downgrade($flag_is_0); my $flag_is_1 = "\xC9ric"; utf8::upgrade($flag_is_1); say substr($flag_is_0, 0, 1) eq substr($flag_is_1, 0, 1) ?1:0; # 1

      I shall endeavor to find something clearer.

      just do mind that not all people share your definition

      Thus this post. If I refer them to this post, they can understand what I say even if their definitions are different.

        Right, we do mean something different with "the UTF-8 doesn't matter". I interpret that as the only difference between the internal representation of the strings is whether the UTF-8 flag is set or not -- but you use it to mean "it doesn't matter whether the internal encoding is UTF-8 or not".
        use Devel::Peek; my $x = my $y = "\xC9ric"; utf8::upgrade($x); utf8::upgrade($y); utf8::encode($y); Dump($x); Dump($y); # Now $x and $y differ only in the setting of the UTF-8 flag say substr($x, 0, 1) eq substr($y, 0, 1) ? "equal" : "different"; __END__ SV = PV(0x8cd80cc) at 0x8cea9ec REFCNT = 1 FLAGS = (PADMY,POK,pPOK,UTF8) PV = 0x8cef6e8 "\303\211ric"\0 [UTF8 "\x{c9}ric"] CUR = 5 LEN = 9 SV = PV(0x8cd803c) at 0x8ceaa28 REFCNT = 1 FLAGS = (PADMY,POK,pPOK) PV = 0x8cf38b8 "\303\211ric"\0 CUR = 5 LEN = 9 different
Re: Jargon relating to Perl strings
by tchrist (Pilgrim) on Jan 17, 2012 at 14:00 UTC
    Iíve certainly struggled with all this myself, trying to come up with a clean and consistent way to talk about these things. I applaud the effort, because of how confused people get over all these things. I have a sneaking suspicion that itís our own fault that folks get confused, although I canít pin my nebulous feeling down any better than I have just now stated.

    One difficulty youíre having is that you are sometimes talking about abstract strings but at others about the properties of how physical memory is laid out. That is probably too much to ask for all in one go. Even if that is your real goal here, I would exercise some caution in the order of presentation.

    If you (initially, and perhaps always) limit the discussion to abstract strings alone, then I do believe that a consistent set of terms can be derived, mostly along the lines that you initially pursue.


    A Perl string is an ordered sequence (like a list or an array) of zero or more individual scalar values. These scalar values are sometimes called code points when the number is what is emphasized, but more often called characters, albeit somewhat misleadingly. The word Ďcharacterí has glyphic connotations, or even typewriter-keystroke connotations. Itís certainly a massively common shorthand, and perhaps even a reasonably serviceable one, but it it is not without its pitfalls.

    A code point that fits within 8 bits is sometimes called a byte. A code point that fits within 21 bits is sometimes called a Unicode code point. Perlís code points are not limited to 21 bits, but to the size of your systemís largest unsigned integer, probably either 32 or 64 bits.

    Unicode recognizes only two abstractions: code points and grapheme sequences. Both are determinable programmatically. A code point corresponds to what the programmer is apt to think a character to mean, being an individual scalar element in a string.

    However, a grapheme sequence is more apt to correspond to what the end-user thinks of as a character, because it looks like a single glyph. For example, the letter b with an acute accent is a grapheme that the user will think of as just one solitary character, whereas the programmer is apt to think of as a sequence of two distinct code points. A very common grapheme that requires two code points is the sequence of a carriage return immediately followed by a line feed.


    As you see, Iíve completely dodged the whole UTF8-flag thing. By giving definititions of string components for just code points alone (and graphemes built up of code points) as the fundamental contituent string components, not about bytes and characters, I (try to) avoid the thornier issues.

    I do not believe the UTF8-flag should be part of the initial presentation, which should have at its heart the simple abstract scalar elements ó here, code points, meaning Ďcharacterí numbers ó of which all Perl strings are made. Abstract code points are the indivisible atoms from which our molecular strings are composed.

    It is my feeling that you have to present a clear picture of how Perl strings work in the abstract before you can get to messy and complicated matters of serialization schemes in physical memory or on disk. For those who need to talk about serializations, which I stress is comparatively few, then and only then can you further elaborate the dirty parts for this much smaller audience.

    But I fear you are going to run into serious trouble if you do so atop on existing notional framework that has pre-existing and conflicting senses for Ďbyteí and Ďcharacterí. Those two terms have too many meanings in other programming languages, so if one stays clear of them, one avoids people thinking they are things they are not.

    Best perhaps to leave it at code point, and perhaps hem a bit about grapheme clusters. Thatís all that matters in an abstract string; physical memory is a different matter, of course.

      If you (initially, and perhaps always) limit the discussion to abstract strings alone,

      This reference will primarily be used when reporting instances of The Unicode Bug. The need for terms for the internals is inevitable as the respondent is sure to bring them into the discussion.

      Your passage would be great in documentation, but it does not serve my needs.

      Unicode recognizes only two abstractions: code points and grapheme sequences.

      I specifically avoided touching Unicode. Unicode has well defined terms, and the conversations in which they would be used differ from the conversations in which those I listed would be used.

      As you see, Iíve completely dodged the whole UTF8-flag thing. By giving definititions of string components for just code points alone (and graphemes built up of code points) as the fundamental contituent string components, not about bytes and characters, I (try to) avoid the thornier issues.
      And by avoiding the UTF-8 flag/encoding, you're creating confusion. $x = "\xBB"; utf8::upgrade($x);. Now it's not clear to me whether you consider $x to by a byte or not. One can encode 0xBB in 8 bits (and it is encoded in 8 bits in LATIN-1), but its Unicode encoding uses 16. So, if you say A code point that fits within 8 bits is sometimes called a byte, that's ambiguous. Whether or not the code point 0xBB fits in 8 bits depends on its encoding.
Re: Jargon relating to Perl strings
by wrog (Monk) on Jan 17, 2012 at 21:34 UTC
    Under Basics
    The sequence of string elements in a string. This is not affected by the string's UTF8 flag.
    This is not what you want to say (it immediately confused me because my first thought was, "This is wrong because if you have a string with non-ASCII characters in it and change its UTF8 flag, that will change the sequence of elements.")

    What I think you meant to say

    The sequence of string elements in a string, irrespective of any particular choice of memory representation being used for that string
    and only bring up the UTF8 flag later.

    Also, IMHO there needs to be distinct terminology for

    1. a grouping of (usually but not always) 8 consecutive bits of physical storage
    2. the abstract array element in the case where all elements are expected to be in the range 0-255, regardless of the actual storage format
    the problem being (as noted by others) that most people associate "byte" with (1). Using it for (2) is unlikely to reduce the confusion out there and not having distinct terminology makes it difficult to talk about storage formats.

    One could possibly commandeer "octet" for (2) but realize that "octet" originated in the RFC world where a word was needed to refer to physical storage in the specific case where bytes explicitly are known to be 8 bits. On the other hand, most of the stuff in the RFC world is indeed trying to abstract away from specific hardware, so one could justify its usage in a more abstract sense that way. And "octet" does, at least, immediately imply 0-255, unlike "byte"

    There's also the small matter that it really doesn't make a whole lot of sense to use UTF-8-flag-on format to store something that is composed of octets, even if it is indeed possible to do. Which is why people conflate octet strings with the UTF-8-flag-off format and its 1-1 correspondence between octets and bytes
    (...and thus why it is indeed important to point out that UTF-8-flag-on octet strings are possible, albeit silly...)

    Update: yeah I edited this... sorry.

      This is not what you want to say

      Indeed. I discovered this reading JavaFan's post.

      What I think you meant to say

      Perfect. I shall update.

      a grouping of (usually but not always) 8 consecutive bits of physical storage

      UTF8=0 storage format.

      the problem being (as noted by others) that most people associate "byte" with (1)

      If so, then reading 5 bytes produces a variable number of bytes*. That doesn't jive.

      If I read 5 bytes from a file, what I get if 5 bytes as far as I'm concerned. I'm open to a better word that "byte" for this, but I haven't come across one.

      One could possibly commandeer "octet"

      An octet is simply an 8-bit byte. But since that's what byte means in all relevant circumstances anyway*, "octet" is no better than "byte".

      * — Perl doesn't currently support systems with byte sizes other than 8.

      There's also the small matter that it really doesn't make a whole lot of sense to use UTF-8-flag-on format to store something that is composed of octets, even if it is indeed possible to do.

      No, but it can happen. Say you have:

      # … as C9 in source. print $bin_fh "AX100…X";

      And say one day you convert your source files to UTF-8.

      # … as C3 E9 in source. use utf8; print $bin_fh "AX100…X";

      The code is still fine, yet the string in the latter has UTF8=1.

        Perfect. I shall update.
        except you updated the wrong thing. It's the sentence "This is not affected by the string's UTF8 flag," under Basics->"String Value" that's tripping me (and apparently also javafan) up and that needs to either go away or be changed
        a grouping of (usually but not always) 8 consecutive bits of physical storage
        UTF8=0 storage format.
        No, any storage format. In order to talk about storage formats at all you need a word for the raw underlying bytes whatever they are and however they're to be interpreted, and redefining "byte" to mean something else makes this really difficult.

        You need a different word, and you're probably right that "octet" isn't a great choice either, so I had another thought: How about one of the following to refer to string elements that are constrained to lie in the 0-255 range?

        • "octetchar"
        • "octet-character"
        • "bytecharacter"
        • "byte-character"
        • "bytechar"
        as opposed to "general character" or "Unicode character" when the full Unicode (or UV) range is possible. This way you're emphasizing that they're still characters in the sense that everybody agrees on (i.e., they're elements of a string and we're abstracting away from how they're represented). If you then say that a single octetchar can actually be multiple bytes in the UTF8=1 storage format, the meaning is clear.
Re: Jargon relating to Perl strings
by BrowserUk (Pope) on Jan 17, 2012 at 23:54 UTC

    Any glossary that starts out by trying to redefine the (industry standard) term "byte" to mean something other than the "minimally addressable unit of memory by the processor", invalidates itself from that point on. Just pissing in the wind.

    As for "the Unicode bug", Unicode is the bug.

    There's no point in rehashing the arguments, but one thing worth saying is that as long as people like you continue to try and rewrite history is this way, in an attempt to excuse the broken Perl implementation of Unicode, the longer it will be before we can get back to a world of sensible, sane, predictable and intuitively usable semantics.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    The start of some sanity?

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlmeditation [id://948243]
Approved by ww
Front-paged by Arunbear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others pondering the Monastery: (6)
As of 2014-09-18 00:37 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (101 votes), past polls