http://www.perlmonks.org?node_id=863082

flexvault has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

I was helped and learned from the answers to my post Substitution, '+' in first position of Pattern. But, after thinking about the requirement for quotemeta or \Q, I realize I must have a 'grave' misunderstanding about how perl keeps variables internally. Most of my scripts work on files, and I have tested that all 256 (0-255) possible byte values are available (Why I love perl).

So my question is:

Are all string variables internally represented by backslashing non-alphanumeric characters or are there internally different types of string variables?

Thank you

Note: I searched for this first, but I got too much information about utf8.

Replies are listed 'Best First'.
Re: Perl5 Internal Representation of string variable
by ig (Vicar) on Oct 02, 2010 at 17:42 UTC

    perlunitut Definitions gives definitions for text, character, unicode (three terms for the same thing) and binary and byte (two terms for the same thing) strings. It also says a little about internal format. The other documents referenced in the SEE ALSO section provide more detail.

    While the basic data types in Perl are three: scalar, array and hash, this is only one level of abstraction. If you delve a little deeper, you find that there are five types of values that can be loaded into a scalar, one of which is a string. There are various data structures used to store these types of values. You can read about this and more in perlguts and B. Other good sources on the internal data structures are Perl 5 Internals - Internal Variables and PerlGuts Illustrated. Ultimately, you might spend some time studying the source code, which is freely available but not a trivial endeavor.

      Thank you

      ig answer pointed to Perl 5 Internals - Internal Variables, which if I had previously known about, I probably would not have wasted anyone's time. Using Devel::Peek, I could see exactly how the internal representation of the string variable was defined. Great answer.

      Marshall's answer was correct initially, since all string type scalars end with a '\0' as well as allow them in the content.

      I compile all my own versions of perl, so I have often looked at the source, but I agree it is "not a trivial endeavor."

Re: Perl5 Internal Representation of string variable
by halfcountplus (Hermit) on Oct 02, 2010 at 16:11 UTC

    It would not make sense to escape or backslash any characters at all in the internal representation -- they are, byte for byte, exactly what they are.

    A perl "string" is just a contextual perspective on the scalar datatype. Eg, if you want to compare two scalars that contain string values, you would use "eq" to indicate that is the context. If the scalars contain numerical values and you want to compare them as numbers, you would use "==". You can use "eq" on scalars that are just numbers which treats them, contextually, as strings. Ie, "string" is not a datatype in perl. There are only three datatypes: scalars, arrays, hashes.

      Okay, now I'm confused at a higher level!

      My understanding of the perl scalar is the same as you describe, so is it the s/// operator that requires backslash characters?

      Thank you

        You need the backslash \, when the following character has a special meaning. Whether that is required or not is context dependent. In a regex you have to backslash the [ character because that character has a special meaning in a regex. But in a print statement this is not necessary.
Re: Perl5 Internal Representation of string variable
by Marshall (Canon) on Oct 02, 2010 at 17:03 UTC
    Simple answer is no.

    A Perl ASCII string is pretty much like a string in 'C'. This is a sequence of bytes in memory terminated by a x00 byte. Each character is encoded as one byte. In Perl you will never come across this last "null" byte.For fancy multi-byte character sets, I defer to wiser Monks than me.

    In a 'C' or a Perl string, you will come across things like \n and \r. That backslash means, "hey this is NOT an "n" or an "r", this is something special and means 0x0A (new line) or 0x0D (carriage return) respectively.

    In a Perl string, if you want something that otherwise would have a meaning, like the double quote character " to be taken literally (not part of Perl's translations), you put a \ in front of it.

    print " this is a double quote, a \" \n";
    In other words, this backslash thing means that the character which follows should be interpreted with a special meaning, if there is any such meaning. In the above, the backslash before the double quote (") means, hey this is not the end of the print statement quote, but rather please print literally a double quote. The \n means: this is not an "n", but rather a 0x0A character.

    I guess this as clear as mud, but I tried.

    Updated with some strike-thru's.

      It will be helpful to distinguish between the strings that scalars contain and quoted strings, including string literals. Escape characters and special codes, such as "\r", "\n", "\t", etc. have special meaning in some quoted strings, depending on the type of quotes used (see Quote and Quote like Operators), but not in the values of scalars. The values of scalars are sequences of characters or bytes, depending on whether they are character or byte strings. A string may contain the characters '\' and 'n', but neither has any special meaning in that context - they are just characters (or bytes) in the string.

      re: ". . .bytes in memory terminated by a x00 byte. "

      In 'C' => 'YES', in Perl => 'want to know???'

      Either perl has a count of characters for variable or some other mechanism to determine the size of the variable that contains a group of bytes. Can't use x00 as terminator, since perl wonderfully allows x00 to be a valid byte.

        Yes, indeed,
        my $string = "asdf \x00 some more";
        print $string;
        does print past the asdf. So, a Perl string does know "how big it is".
Re: Perl5 Internal Representation of string variable
by muba (Priest) on Oct 03, 2010 at 02:51 UTC

    What you have to realize here is that you're dealing with two entirely different concepts, one of which is seamlessly intergrated into the other.

    On one hand, we have Perl. In Perl, the character + has a special meaning, which in most cases would be the additive operator (IE, it adds two numbers). " is another example of a character with a special meaning. In most cases " comes in pairs and a pair of these double quotes is one of many ways to mark the part between the pair of quotes as a string. + inside a string has no special meaning.

    So, basically, $foo = 3 + 4 assigns the value of 7 to $foo. $foo = "3 + 4" assigns the string of 3 + 4 to $foo, but doesn't add 3 and 4.

    The second concept we're dealing with, is regular expressions. In regexes, + has a special meaning too. It's not an additive operator (so m/3 + 4/ doesn't result in something somehow matching the value 7). In regexes, + means that the symbol that precedes it, is to appear one or more times in the string you're matching against. So m/3 + 4/ would match a literal 3, followed by one or more spaces, then another space, and then a litereal 4.</c>

    So if you have a string like, say, $foo = "+bar", then that's just fine. It's a string that consists of a literal "+", followed by the word "bar". Fine. But if you go m/+bar/, then you want is a bit unclear to me, but it sort of looks like (nothing, one or more times), followed by a literal occurance of the word "bar".

    Like I said earlier, Perl seamlesly intergrates regexes into its language and allows you to do usefull things such as m/$foo/, incorporating a string you defined earlier into your regex. However, if $foo = "+bar", then the + inside that string does take its special meaning inside the regex, even though it doesn't mean anything but just a "+" character inside the string.

    So it's not about how Perl internally represents strings, it's all about how regular expressions are a language within a language, in which certain characters or character sequences have special meanings which differ from the meaning they have in Perl. That's the whole point.

      Thank you for an excellent explanation. It was/is the regex language within perl that I failed to grasp.

        It was/is the regex language within perl that I failed to grasp.

        You may find My Favourite Regex Tools useful.

        HTH,

        planetscape
Re: Perl5 Internal Representation of string variable
by repellent (Priest) on Oct 03, 2010 at 19:58 UTC
    quotemeta has little to do with how Perl keeps variables internally. You need to separate those concerns to avoid confusion.

    quotemeta is used to insert backslashes preceding non-word characters in a string for the purpose of avoiding regex meta-ness if that string were to be used in a regular expression.

    In addition, you need to separate the concerns of how Perl keeps variables internally from how you perceive Perl strings.

    Treat a Perl string as a string of characters (in the abstract sense, not in the char C sense). How each character is stored internally is a separate issue. How each character is represented (as bytes) when you print them out can be decided based on how they are encoded.

    This separation of concerns allows the wonderful use of Unicode. We can rest easy knowing that each character is not limited to 256 or 65536 (or whatever) different types. We treat characters as characters - today, Perl operations like regexp matching work on characters, so do length, substr, etc.

    Miss the old-think where strings were composed of just bytes? Then map the new-think of character strings back to where each character can have 256 different types (1 byte per character) and you'll have things back to the old way. Caveat: if you're taking this approach, you won't be able to represent Unicode characters from Latin Extended onwards.

    Perl has no concept of NUL-terminated strings. In the example below, when we store a NUL byte as a Perl string, the string is interpreted as having a single NUL character:
    use Devel::Peek; my $v = "\x{00}"; # $v is a string with a NUL character Dump $v; __END__ SV = PV(0x100801c78) at 0x1008143e8 REFCNT = 1 FLAGS = (PADMY,POK,pPOK) PV = 0x100218910 "\0"\0 CUR = 1 LEN = 8

    This may be of interest: Why Not Translate Perl to C?
Re: Perl5 Internal Representation of string variable
by ikegami (Patriarch) on Oct 04, 2010 at 04:23 UTC

    A string literal is a piece of code that instructs Perl how to build the string in memory. The escapes and interpolations are processed (at compile-time and run-time respectively) to produce a string that contains neither.

    For example, the string literal "abc\n" produces a four character string when evaluated.