Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

perlman:perlop2

by gods (Initiate)
on Aug 25, 1999 at 06:09 UTC ( [id://378]=perlman: print w/replies, xml ) Need Help??

perlop2

Current Perl documentation can be found at perldoc.perl.org.

Here is our local, out-dated (pre-5.6) version:

Gory details of parsing quoted constructs

When presented with something which may have several different interpretations, Perl uses the principle DWIM (expanded to Do What I Mean - not what I wrote) to pick up the most probable interpretation of the source. This strategy is so successful that Perl users usually do not suspect ambivalence of what they write. However, time to time Perl's ideas differ from what the author meant.

The target of this section is to clarify the Perl's way of interpreting quoted constructs. The most frequent reason one may have to want to know the details discussed in this section is hairy regular expressions. However, the first steps of parsing are the same for all Perl quoting operators, so here they are discussed together.

Some of the passes discussed below are performed concurrently, but as far as results are the same, we consider them one-by-one. For different quoting constructs Perl performs different number of passes, from one to five, but they are always performed in the same order.

Finding the end

First pass is finding the end of the quoted construct, be it multichar ender "\nEOF\n" of <<EOF construct, / which terminates perlman:perlop construct, ] which terminates perlman:perlop construct, or > which terminates a fileglob started with <.

When searching for multichar construct no skipping is performed. When searching for one-char non-matching delimiter, such as /, combinations \\ and \/ are skipped. When searching for one-char matching delimiter, such as ], combinations \\, \] and \[ are skipped, and nested [, ] are skipped as well.

For 3-parts constructs, perlman:perlop etc. the search is repeated once more.

During this search no attention is paid to the semantic of the construct, thus

    "$hash{"$foo/$bar"}"

or

    m/ 
      bar       #  This is not a comment, this slash / terminated m//!
     /x

do not form legal quoted expressions. Note that since the slash which terminated perlman:perlop was followed by a SPACE, this is not m//x, thus # was interpreted as a literal #.

Removal of backslashes before delimiters

During the second pass the text between the starting delimiter and the ending delimiter is copied to a safe location, and the \ is removed from combinations consisting of \ and delimiter(s) (both starting and ending delimiter if they differ).

The removal does not happen for multi-char delimiters.

Note that the combination \\ is left as it was!

Starting from this step no information about the delimiter(s) is used in the parsing.

Interpolation

Next step is interpolation in the obtained delimiter-independent text. There are four different cases.
C<<<'EOF'>, m'', s''', tr///, y///

No interpolation is performed.

'', q//

The only interpolation is removal of \ from pairs \\.

"", ``, qq//, qx//, C<<file*globgt>

\Q, \U, \u, \L, \l (possibly paired with \E) are converted to corresponding Perl constructs, thus "$foo\Qbaz$bar" is converted to

   $foo . (quotemeta("baz" . $bar));

Other combinations of \ with following chars are substituted with appropriate expansions.

Interpolated scalars and arrays are converted to join and . Perl constructs, thus "'@arr'" becomes

  "'" . (join $", @arr) . "'";

Since all three above steps are performed simultaneously left-to-right, the is no way to insert a literal $ or @ inside \Q\E pair: it cannot be protected by \, since any \ (except in \E) is interpreted as a literal inside \Q\E, and any $ is interpreted as starting an interpolated scalar.

Note also that the interpolating code needs to make decision where the interpolated scalar ends, say, whether "a $b -> {c}" means

  "a " . $b . " -> {c}";

or

  "a " . $b -> {c};

Most the time the decision is to take the longest possible text which does not include spaces between components and contains matching braces/brackets.

?RE?, /RE/, m/RE/, s/RE/foo/,

Processing of \Q, \U, \u, \L, \l and interpolation happens (almost) as with perlman:perlop constructs, but the substitution of \ followed by other chars is not performed! Moreover, inside (?{BLOCK}) no processing is performed at all.

Interpolation has several quirks: $|, $( and $) are not interpolated, and constructs $var[SOMETHING] are voted (by several different estimators) to be an array element or $var followed by a RE alternative. This is the place where the notation ${arr[$bar]} comes handy: /${arr[0-9]}/ is interpreted as an array element -9, not as a regular expression from variable $arr followed by a digit, which is the interpretation of /$arr[0-9]/.

Note that absence of processing of \\ creates specific restrictions on the post-processed text: if the delimiter is /, one cannot get the combination \/ into the result of this step: / will finish the regular expression, \/ will be stripped to / on the previous step, and \\/ will be left as is. Since / is equivalent to \/ inside a regular expression, this does not matter unless the delimiter is special character for the RE engine, as in s*foo*bar*, m[foo], or ?foo?.

This step is the last one for all the constructs except regular expressions, which are processed further.

Interpolation of regular expressions

All the previous steps were performed during the compilation of Perl code, this one happens in run time (though it may be optimized to be calculated at compile time if appropriate). After all the preprocessing performed above (and possibly after evaluation if catenation, joining, up/down-casing and quotemeta()ing are involved) the resulting string is passed to RE engine for compilation.

Whatever happens in the RE engine is better be discussed in the perlre manpage, but for the sake of continuity let us do it here.

This is the first step where presence of the //x switch is relevant. The RE engine scans the string left-to-right, and converts it to a finite automaton.

Backslashed chars are either substituted by corresponding literal strings, or generate special nodes of the finite automaton. Characters which are special to the RE engine generate corresponding nodes. (?#...) comments are ignored. All the rest is either converted to literal strings to match, or is ignored (as is whitespace and #-style comments if //x is present).

Note that the parsing of the construct [...] is performed using absolutely different rules than the rest of the regular expression. Similarly, the (?{...}) is only checked for matching braces.

Optimization of regular expressions

This step is listed for completeness only. Since it does not change semantics, details of this step are not documented and are subject to change.


I/O Operators

There are several I/O operators you should know about. A string enclosed by backticks (grave accents) first undergoes variable substitution just like a double quoted string. It is then interpreted as a command, and the output of that command is the value of the pseudo-literal, like in a shell. In scalar context, a single string consisting of all the output is returned. In list context, a list of values is returned, one for each line of output. (You can set $/ to use a different line terminator.) The command is executed each time the pseudo-literal is evaluated. The status value of the command is returned in $? (see the perlvar manpage for the interpretation of $?). Unlike in csh, no translation is done on the return data--newlines remain newlines. Unlike in any of the shells, single quotes do not hide variable names in the command from interpretation. To pass a $ through to the shell you need to hide it with a backslash. The generalized form of backticks is perlman:perlop. (Because backticks always undergo shell expansion as well, see the perlsec manpage for security concerns.)

Evaluating a filehandle in angle brackets yields the next line from that file (newline, if any, included), or undef at end of file. Ordinarily you must assign that value to a variable, but there is one situation where an automatic assignment happens. If and ONLY if the input symbol is the only thing inside the conditional of a while or for(;;) loop, the value is automatically assigned to the variable $_. In these loop constructs, the assigned value (whether assignment is automatic or explicit) is then tested to see if it is defined. The defined test avoids problems where line has a string value that would be treated as false by perl e.g. ``'' or ``0'' with no trailing newline. (This may seem like an odd thing to you, but you'll use the construct in almost every Perl script you write.) Anyway, the following lines are equivalent to each other:

    while (defined($_ = <STDIN>)) { print; }
    while ($_ = <STDIN>) { print; }
    while (<STDIN>) { print; }
    for (;<STDIN>;) { print; }
    print while defined($_ = <STDIN>);
    print while ($_ = <STDIN>);
    print while <STDIN>;

and this also behaves similarly, but avoids the use of $_ :

    while (my $line = <STDIN>) { print $line }    

If you really mean such values to terminate the loop they should be tested for explicitly:

    while (($_ = <STDIN>) ne '0') { ... }
    while (<STDIN>) { last unless $_; ... }

In other boolean contexts, <<EM>filehandle</EM>> without explicit defined test or comparison will solicit a warning if -w is in effect.

The filehandles STDIN, STDOUT, and STDERR are predefined. (The filehandles stdin, stdout, and stderr will also work except in packages, where they would be interpreted as local identifiers rather than global.) Additional filehandles may be created with the open() function. See open() for details on this.

If a <FILEHANDLE> is used in a context that is looking for a list, a list consisting of all the input lines is returned, one line per list element. It's easy to make a LARGE data space this way, so use with care.

The null filehandle <> is special and can be used to emulate the behavior of sed and awk. Input from <> comes either from standard input, or from each file listed on the command line. Here's how it works: the first time <> is evaluated, the @ARGV array is checked, and if it is empty, $ARGV[0] is set to ``-'', which when opened gives you standard input. The @ARGV array is then processed as a list of filenames. The loop

    while (<>) {
        ...                     # code for each line
    }

is equivalent to the following Perl-like pseudo code:

    unshift(@ARGV, '-') unless @ARGV;
    while ($ARGV = shift) {
        open(ARGV, $ARGV);
        while (<ARGV>) {
            ...         # code for each line
        }
    }

except that it isn't so cumbersome to say, and will actually work. It really does shift array @ARGV and put the current filename into variable $ARGV. It also uses filehandle ARGV internally--<> is just a synonym for <ARGV>, which is magical. (The pseudo code above doesn't work because it treats <ARGV> as non-magical.)

You can modify @ARGV before the first <> as long as the array ends up containing the list of filenames you really want. Line numbers ($.) continue as if the input were one big happy file. (But see example under eof for how to reset line numbers on each file.)

If you want to set @ARGV to your own list of files, go right ahead. This sets @ARGV to all plain text files if no @ARGV was given:

    @ARGV = grep { -f && -T } glob('*') unless @ARGV;

You can even set them to pipe commands. For example, this automatically filters compressed arguments through gzip:

    @ARGV = map { /\.(gz|Z)$/ ? "gzip -dc < $_ |" : $_ } @ARGV;

If you want to pass switches into your script, you can use one of the Getopts modules or put a loop on the front like this:

    while ($_ = $ARGV[0], /^-/) {
        shift;
        last if /^--$/;
        if (/^-D(.*)/) { $debug = $1 }
        if (/^-v/)     { $verbose++  }
        # ...           # other switches
    }

    while (<>) {
        # ...           # code for each line
    }

The <> symbol will return undef for end-of-file only once. If you call it again after this it will assume you are processing another @ARGV list, and if you haven't set @ARGV, will input from STDIN.

If the string inside the angle brackets is a reference to a scalar variable (e.g., <$foo>), then that variable contains the name of the filehandle to input from, or its typeglob, or a reference to the same. For example:

    $fh = \*STDIN;
    $line = <$fh>;

If what's within the angle brackets is neither a filehandle nor a simple scalar variable containing a filehandle name, typeglob, or typeglob reference, it is interpreted as a filename pattern to be globbed, and either a list of filenames or the next filename in the list is returned, depending on context. This distinction is determined on syntactic grounds alone. That means <$x> is always a readline from an indirect handle, but <$hash{key}> is always a glob. That's because $x is a simple scalar variable, but $hash{key} is not--it's a hash element.

One level of double-quote interpretation is done first, but you can't say <$foo> because that's an indirect filehandle as explained in the previous paragraph. (In older versions of Perl, programmers would insert curly brackets to force interpretation as a filename glob: <${foo}>. These days, it's considered cleaner to call the internal function directly as glob($foo), which is probably the right way to have done it in the first place.) Example:

    while (<*.c>) {
        chmod 0644, $_;
    }

is equivalent to

    open(FOO, "echo *.c | tr -s ' \t\r\f' '\\012\\012\\012\\012'|");
    while (<FOO>) {
        chop;
        chmod 0644, $_;
    }

In fact, it's currently implemented that way. (Which means it will not work on filenames with spaces in them unless you have csh(1) on your machine.) Of course, the shortest way to do the above is:

    chmod 0644, <*.c>;

Because globbing invokes a shell, it's often faster to call readdir() yourself and do your own grep() on the filenames. Furthermore, due to its current implementation of using a shell, the glob() routine may get ``Arg list too long'' errors (unless you've installed tcsh(1L) as /bin/csh).

A glob evaluates its (embedded) argument only when it is starting a new list. All values must be read before it will start over. In a list context this isn't important, because you automatically get them all anyway. In scalar context, however, the operator returns the next value each time it is called, or a undef value if you've just run out. As for filehandles an automatic defined is generated when the glob occurs in the test part of a while or for - because legal glob returns (e.g. a file called 0) would otherwise terminate the loop. Again, undef is returned only once. So if you're expecting a single value from a glob, it is much better to say

    ($file) = <blurch*>;

than

    $file = <blurch*>;

because the latter will alternate between returning a filename and returning FALSE.

It you're trying to do variable interpolation, it's definitely better to use the glob() function, because the older notation can cause people to become confused with the indirect filehandle notation.

    @files = glob("$dir/*.[ch]");
    @files = glob($files[$i]);


Constant Folding

Like C, Perl does a certain amount of expression evaluation at compile time, whenever it determines that all arguments to an operator are static and have no side effects. In particular, string concatenation happens at compile time between literals that don't do variable substitution. Backslash interpretation also happens at compile time. You can say

    'Now is the time for all' . "\n" .
        'good men to come to.'

and this all reduces to one string internally. Likewise, if you say

    foreach $file (@filenames) {
        if (-s $file > 5 + 100 * 2**16) {  }
    }

the compiler will precompute the number that expression represents so that the interpreter won't have to.


Bitwise String Operators

Bitstrings of any size may be manipulated by the bitwise operators (~ | & ^).

If the operands to a binary bitwise op are strings of different sizes, or and xor ops will act as if the shorter operand had additional zero bits on the right, while the and op will act as if the longer operand were truncated to the length of the shorter.

    # ASCII-based examples 
    print "j p \n" ^ " a h";            # prints "JAPH\n"
    print "JA" | "  ph\n";              # prints "japh\n"
    print "japh\nJunk" & '_____';       # prints "JAPH\n";
    print 'p N$' ^ " E<H\n";            # prints "Perl\n";

If you are intending to manipulate bitstrings, you should be certain that you're supplying bitstrings: If an operand is a number, that will imply a numeric bitwise operation. You may explicitly show which type of operation you intend by using "" or 0+, as in the examples below.

    $foo =  150  |  105 ;       # yields 255  (0x96 | 0x69 is 0xFF)
    $foo = '150' |  105 ;       # yields 255
    $foo =  150  | '105';       # yields 255
    $foo = '150' | '105';       # yields string '155' (under ASCII)

    $baz = 0+$foo & 0+$bar;     # both ops explicitly numeric
    $biz = "$foo" ^ "$bar";     # both ops explicitly stringy


Integer Arithmetic

By default Perl assumes that it must do most of its arithmetic in floating point. But by saying

    use integer;

you may tell the compiler that it's okay to use integer operations from here to the end of the enclosing BLOCK. An inner BLOCK may countermand this by saying

    no integer;

which lasts until the end of that BLOCK.

The bitwise operators (``&'', ``|'', ``^'', ``~'', ``<<``, and ''>>``) always produce integral results. (But see also Bitwise String Operators.) However, use integer still has meaning for them. By default, their results are interpreted as unsigned integers. However, if use integer is in effect, their results are interpreted as signed integers. For example, ~0 usually evaluates to a large integral value. However, use integer; ~0 is -1 on twos-complement machines.


Floating-point Arithmetic

While use integer provides integer-only arithmetic, there is no similar ways to provide rounding or truncation at a certain number of decimal places. For rounding to a certain number of digits, sprintf() or printf() is usually the easiest route.

Floating-point numbers are only approximations to what a mathematician would call real numbers. There are infinitely more reals than floats, so some corners must be cut. For example:

    printf "%.20g\n", 123456789123456789;
    #        produces 123456789123456784

Testing for exact equality of floating-point equality or inequality is not a good idea. Here's a (relatively expensive) work-around to compare whether two floating-point numbers are equal to a particular number of decimal places. See Knuth, volume II, for a more robust treatment of this topic.

    sub fp_equal {
        my ($X, $Y, $POINTS) = @_;
        my ($tX, $tY);
        $tX = sprintf("%.${POINTS}g", $X);
        $tY = sprintf("%.${POINTS}g", $Y);
        return $tX eq $tY;
    }

The POSIX module (part of the standard perl distribution) implements ceil(), floor(), and a number of other mathematical and trigonometric functions. The Math::Complex module (part of the standard perl distribution) defines a number of mathematical functions that can also work on real numbers. Math::Complex not as efficient as POSIX, but POSIX can't work with complex numbers.

Rounding in financial applications can have serious implications, and the rounding method used should be specified precisely. In these cases, it probably pays not to trust whichever system rounding is being used by Perl, but to instead implement the rounding function you need yourself.


Bigger Numbers

The standard Math::BigInt and Math::BigFloat modules provide variable precision arithmetic and overloaded operators. At the cost of some space and considerable speed, they avoid the normal pitfalls associated with limited-precision representations.

    use Math::BigInt;
    $x = Math::BigInt->new('123456789123456789');
    print $x * $x;

    # prints +15241578780673678515622620750190521


Return to the Library
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (4)
As of 2024-04-18 03:17 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found