Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Re: How does the built-in function length work?

by moritz (Cardinal)
on Dec 02, 2011 at 14:42 UTC ( [id://941335]=note: print w/replies, xml ) Need Help??


in reply to How does the built-in function length work?

There are two ways that perl 5 stores strings. If the UTF8 flag is set, length returns the number of characters, not bytes.

If the flag is not set, perl assumes that the encoding is ISO-8859-1, and there the number of bytes is equal to the number of characters.

  • Comment on Re: How does the built-in function length work?

Replies are listed 'Best First'.
Re^2: How does the built-in function length work?
by ikegami (Patriarch) on Dec 02, 2011 at 19:30 UTC

    I usually always agree 100% with your posts, but not today.

    Perl operators that deal with text (regex and uc and the like) expect Unicode code points. (Formerly: Expected Unicode code points or ASCII depending on UTF8 flag.)

    Perl operators that deal with file names (open, stat, etc) expect the file names to be bytes.

    Perl never assumes or expects iso-8859-1.

      (Formerly: Expected Unicode code points or ASCII depending on UTF8 flag.)

      I guess that what you call "Unicode code point" is what I call "ISO-8859-1". ISO-8859-1 is simply the encoding that maps the byte values from 0 to 255 to the Unicode codepoints from 0 to 255, in that order.

      Perl never assumes or expects iso-8859-1.
      $ echo -e "\xE4"|perl -wE 'say <> ~~ /\w/' 1 $ # this a perl 5.14.1

      Since no decoding step happened here, and <> is a binary operation, and the regex match a text operation, perl has to assume a character encoding. And that happens to be ISO-8859-1. Or what do you think it is, if not ISO-8859-1?

        Or what do you think it is, if not ISO-8859-1?
        EBCDIC? Binary? ISO-8859-15?

        perl has to assume a character encoding.

        Not at all. If it must assume an encoding, and that encoding is iso-8859-1 for

        "\x{E4}" =~ /\w/

        then what encoding is assumed for the following?

        "\x{2660}" =~ /\w/

        It never deals with any encoding. It always deals with string elements (characters). And those string elements (characters) are assumedrequired to be Unicode code points.

        • Character E4 is taken as Unicode code point E4, not some byte produced by iso-8859-1.
        • Character 2660 is taken as Unicode code point 2660, not some byte produced by iso-8859-1.

        It's entirely up to you to create a string with the right elements, which may or may not involve character encodings.

        Or what do you think it is, if not ISO-8859-1?

        A Unicode code point, regardless of the state of the UTF8 flag.

        • Character E4 (UTF8=0) is taken as Unicode code point E4, not some byte produced by iso-8859-1.
        • Character E4 (UTF8=1) is taken as Unicode code point E4, not some byte produced by iso-8859-1.

        In short, you're over complicating things. It's NOT:

        Each character is expected to be an iso-8859-1 byte if UTF8=0 or a Unicode code point if UTF8=1.

        It's simply:

        Each character is expected to be a Unicode code point.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://941335]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others perusing the Monastery: (5)
As of 2024-04-19 13:49 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found