These are terms I use to refer to certain element or properties of Perl strings. I'm posting this as a reference, not to advocate them.

Basics

Character, or string element

An element of a string, as in what substr($s,$i,1) returns. It's a 72-bit value in theory, but it's limited to the size of a UV in practice.

"Character" is the term used to document Perl functions that work on arbitrary strings (substr, index, reverse, chr, ord, etc) and it's the term used in Wikipedia's definition of "string".

String value

The sequence of string elements in a string (irrespective of the choice of storage format used for that string).

String element semantics

Byte

A string element whose value is understood/expected to be in [0, 255] (irrespective of the choice of storage format used for that string).

Code point, or Unicode code point

A string element whose value is understood/expected to be a Unicode code point (irrespective of the choice of storage format used for that string).

String semantics

Bytes, or string of bytes

A string whose value is understood/expected to be a sequence of values in [0, 255] (irrespective of the choice of storage format used for that string).

Text, or decoded text

A string whose value is understood/expected to be a sequence of Unicode code points (irrespective of the choice of storage format used for that string).

String storage formats

UTF8=0 storage format

The format of the PV in a string whose UTF8 flag is clear (0).

It's unambiguous, but it's quite a mouthful. Some use "byte string", but those who do tend to also use it for what I call "string of bytes".

UTF8=1 storage format

The format of the PV in a string whose UTF8 flag is set (1).

It's unambiguous, but it's quite a mouthful. Some use "character string", but that's incorrect because all strings are made of characters by definition.

Other

The Unicode Bug

If changing the internal storage format of a string changes how a piece of code behaves, that code suffers from The Unicode Bug.

For example, the following code suffers from The Unicode Bug.

use feature qw( say ); use Inline C => <<'__EOI__'; STRLEN mylength(SV* sv) { STRLEN len; (void)SvPV(sv, len); return len; } __EOI__ $x="\xE9"; utf8::downgrade($x); $y="\xE9"; utf8::upgrade($y); say $x eq $y ? "equal" : "not equal"; # equal say mylength($x); # 1 say mylength($y); # 2

Others related terms I've seen used

Byte string

This usually refers to the UTF8=0 storage format, but it could also refere to a string of bytes.

Character string

This usually refers to the UTF8=1 storage format. The term is incorrect since all strings are made of characters by definition.

Byte semantics

This usually refers to how code behaves when given a string in the UTF8=0 storage format, in distinction to how it behaves when given a string in the UTF8=1 storage format. Code that make such a distinction suffer from The Unicode Bug.

Character semantics

This usually refers to how code behaves when given a string in the UTF8=1 storage format, in distinction to how it behaves when given a string in the UTF8=0 storage format. Code that make such a distinction suffer from The Unicode Bug.

Update: Changed "regardless of the value of its UTF8 flag" to something clearer in response to JavaFan's and wrog's comments.
Update: By request, added end tags for DT, DD and P elements even though they are optional.


In reply to Jargon relating to Perl strings by ikegami

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":