Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options

Re: Jargon relating to Perl strings

by JavaFan (Canon)
on Jan 17, 2012 at 10:53 UTC ( #948292=note: print w/replies, xml ) Need Help??

in reply to Jargon relating to Perl strings

I find your definition of byte confusing, and I think most people use it differently. According to your definition,
$x = "\xEC"; utf8::upgrade($x);
now $x consists of a single byte. Even though it requires 16 bits of encoding.

Perhaps the confusion comes from saying that for your definition of a byte, the UTF8 flag doesn't matter, yet it refers to a string element, which is defined in terms of substr, for which the UTF8 flag *does* matter.

I'd say that in my example, $x ends up having 2 bytes, but one character. This is also the difference wc makes.

Of course, you are free to use whatever definition you want -- just do mind that not all people share your definition. Some people prefer not use the term byte at all, just character and octet.

Replies are listed 'Best First'.
Re^2: Jargon relating to Perl strings
by ikegami (Pope) on Jan 17, 2012 at 22:23 UTC

    now $x consists of a single byte.

    Yes, that's what I call a byte. So maybe it's my definition, not my term that's unclear.

    which is defined in terms of substr, for which the UTF8 flag *does* matter

    Ah, there's the problem. "The UTF8 flag doesn't matter" means different things to us. For a given string, substr will always return the same value regardless of the UTF8 flag, so I say the UTF8 doesn't matter to substr.

    my $flag_is_0 = "\xC9ric"; utf8::downgrade($flag_is_0); my $flag_is_1 = "\xC9ric"; utf8::upgrade($flag_is_1); say substr($flag_is_0, 0, 1) eq substr($flag_is_1, 0, 1) ?1:0; # 1

    I shall endeavor to find something clearer.

    just do mind that not all people share your definition

    Thus this post. If I refer them to this post, they can understand what I say even if their definitions are different.

      Right, we do mean something different with "the UTF-8 doesn't matter". I interpret that as the only difference between the internal representation of the strings is whether the UTF-8 flag is set or not -- but you use it to mean "it doesn't matter whether the internal encoding is UTF-8 or not".
      use Devel::Peek; my $x = my $y = "\xC9ric"; utf8::upgrade($x); utf8::upgrade($y); utf8::encode($y); Dump($x); Dump($y); # Now $x and $y differ only in the setting of the UTF-8 flag say substr($x, 0, 1) eq substr($y, 0, 1) ? "equal" : "different"; __END__ SV = PV(0x8cd80cc) at 0x8cea9ec REFCNT = 1 FLAGS = (PADMY,POK,pPOK,UTF8) PV = 0x8cef6e8 "\303\211ric"\0 [UTF8 "\x{c9}ric"] CUR = 5 LEN = 9 SV = PV(0x8cd803c) at 0x8ceaa28 REFCNT = 1 FLAGS = (PADMY,POK,pPOK) PV = 0x8cf38b8 "\303\211ric"\0 CUR = 5 LEN = 9 different
        This has already been addressed in the OP, in case you didn't notice.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://948292]
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others having an uproarious good time at the Monastery: (8)
As of 2017-10-23 20:12 GMT
Find Nodes?
    Voting Booth?
    My fridge is mostly full of:

    Results (285 votes). Check out past polls.