Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

Re: Jargon relating to Perl strings

by JavaFan (Canon)
on Jan 17, 2012 at 10:53 UTC ( #948292=note: print w/replies, xml ) Need Help??


in reply to Jargon relating to Perl strings

I find your definition of byte confusing, and I think most people use it differently. According to your definition,
$x = "\xEC"; utf8::upgrade($x);
now $x consists of a single byte. Even though it requires 16 bits of encoding.

Perhaps the confusion comes from saying that for your definition of a byte, the UTF8 flag doesn't matter, yet it refers to a string element, which is defined in terms of substr, for which the UTF8 flag *does* matter.

I'd say that in my example, $x ends up having 2 bytes, but one character. This is also the difference wc makes.

Of course, you are free to use whatever definition you want -- just do mind that not all people share your definition. Some people prefer not use the term byte at all, just character and octet.

Replies are listed 'Best First'.
Re^2: Jargon relating to Perl strings
by ikegami (Pope) on Jan 17, 2012 at 22:23 UTC

    now $x consists of a single byte.

    Yes, that's what I call a byte. So maybe it's my definition, not my term that's unclear.

    which is defined in terms of substr, for which the UTF8 flag *does* matter

    Ah, there's the problem. "The UTF8 flag doesn't matter" means different things to us. For a given string, substr will always return the same value regardless of the UTF8 flag, so I say the UTF8 doesn't matter to substr.

    my $flag_is_0 = "\xC9ric"; utf8::downgrade($flag_is_0); my $flag_is_1 = "\xC9ric"; utf8::upgrade($flag_is_1); say substr($flag_is_0, 0, 1) eq substr($flag_is_1, 0, 1) ?1:0; # 1

    I shall endeavor to find something clearer.

    just do mind that not all people share your definition

    Thus this post. If I refer them to this post, they can understand what I say even if their definitions are different.

      Right, we do mean something different with "the UTF-8 doesn't matter". I interpret that as the only difference between the internal representation of the strings is whether the UTF-8 flag is set or not -- but you use it to mean "it doesn't matter whether the internal encoding is UTF-8 or not".
      use Devel::Peek; my $x = my $y = "\xC9ric"; utf8::upgrade($x); utf8::upgrade($y); utf8::encode($y); Dump($x); Dump($y); # Now $x and $y differ only in the setting of the UTF-8 flag say substr($x, 0, 1) eq substr($y, 0, 1) ? "equal" : "different"; __END__ SV = PV(0x8cd80cc) at 0x8cea9ec REFCNT = 1 FLAGS = (PADMY,POK,pPOK,UTF8) PV = 0x8cef6e8 "\303\211ric"\0 [UTF8 "\x{c9}ric"] CUR = 5 LEN = 9 SV = PV(0x8cd803c) at 0x8ceaa28 REFCNT = 1 FLAGS = (PADMY,POK,pPOK) PV = 0x8cf38b8 "\303\211ric"\0 CUR = 5 LEN = 9 different
        This has already been addressed in the OP, in case you didn't notice.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://948292]
help
Chatterbox?
[Corion]: marto: Oof - that means taking a day off or can you work from home?
[marto]: Corion, I can do some non-technical things, but there's no way for me to connect to the clients network.
[marto]: which is a shame, I had a really productive day yesterday
[Corion]: marto: Meh, so it'll be a day of cleaning out email...
[marto]: and hoped that I'd be able to continue the momentum :)
[marto]: I can't even access client email, nor my employers since those idiots moved to citrix
[marto]: it literally doesn't work. Also, via their citrix interface theres no way to open attachments, or upload an attachment for sending. The company policy is that you email you work account from a personal one, and forward it on from there :/

How do I use this? | Other CB clients
Other Users?
Others pondering the Monastery: (4)
As of 2018-01-16 08:51 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    How did you see in the new year?










    Results (175 votes). Check out past polls.

    Notices?