comment on

This meditation answers the question why writing pos($_) = pos($_); in a perl program could ever make sense.

The following code tokenizes AMSrefs format. It is an excerpt from an actual code I wrote, modified slightly here to run standalone. (You don't need to know what AMSrefs is to understand this meditation, but it's a text format to describe bibliographic references in scientific publications, similar to BibTeX.)

use warnings; use strict;

# example input
$_ = q(
\bib{ref0}{article}{
        author={Y. Bartal},
        volume={37},
        pages={184},
        date={1996},
        issn={0272-5428},
}
);
[download]

# tokenizer rules
our @toktab = (
    [qr/\\bib(?![A-Za-z])/, "bib"],
    [qr/\\(?:[A-Za-z]+|.)/s, "text"],
    [qr/\%.*\n\s*/, "comment"],
    [qr/\=/, "equal"],
    [qr/\{/, "begin"],
    [qr/\}/, "end"],
    [qr/\s+/, "space"],
    [qr/[A-Za-z0-9_\-\.]+/, "word"],
    [qr/[^\\\%\=\{\}\sA-Za-z0-9_\-\.]/, "text"],
);

# tokens buffer
our(@tokfd);

# tokenize amsrefs input
TOK: while (1) {
    for my $tokrul (@toktab) {
        my($re, $id) = @$tokrul;
        if (/\G($re)/gc) {
            push @tokfd, [$id, $1];
            next TOK;
        }
    }
    pos($_) = pos($_); # <--- line 39
    if (/\G./sgc) { # <--- line 40
        die "internal error: amsref reader tokenizer cannot match inpu
+t line: ($_) at" . pos($_);
    } elsif (/\G\z/gc) { # <--- line 42
        last;
    } else { # <--- line 44
        die "internal error: amsref reader tokenizer really cannot mat
+ch input line: ($_) " . pos($_);
    }
}

# dump tokens for debugging
for my $t (@tokfd) {
    my($i, $c) = @$t; 
    $c =~ s/\n/\\n/g;
    printf qq(%-8s "%s"\n), $i, $c;
}

__END__
[download]

As you can see, line 39 has the expression pos($_) = pos($_). This doesn't seem to do anything useful, so I'll explain here why I added it to the code.

If you run this code, you see that it dumps the type and content of each token it finds. When writing the code, I tested it exactly this way: I ran it on some example input and made it print the tokens it's got. I would think that if I made an error in one of the tokenizing rules, it would be easy to find out by examining the output. If a rule would match too much, I would see tokens in the output where there's no such token; if it would match too little, I would see either other tokens matching that part of the text, or, at worse, the fallback rule in line 41 kick in if no other rule would match.

Indeed, I made some mistake in one of the regexes in @toktab, and this was such that regex would accidentally match the empty string. I'm not sure what the exact mistake was, but let's assume that I wrote [qr/[A-Za-z0-9_\-\.]*/, "word"], instead of [qr/[A-Za-z0-9_\-\.]+/, "word"],. Per what I said above, one would think that this mistake was easy to recognize: we'd get lots of extra tokens with type word with the empty string as content. Indeed, if you change the regex this way in this code, you get that result.

That's not the output I'd see at that time though. If you both introduce this mistake in the regex and remove line 39 from the above code, so you get the following code, and run it, you'll see what I have got.

use warnings; use strict;

# example input
$_ = q(
\bib{ref0}{article}{
        author={Y. Bartal},
        volume={37},
        pages={184},
        date={1996},
        issn={0272-5428},
}
);

# tokenizer rules
our @toktab = (
    [qr/\\bib(?![A-Za-z])/, "bib"],
    [qr/\\(?:[A-Za-z]+|.)/s, "text"],
    [qr/\%.*\n\s*/, "comment"],
    [qr/\=/, "equal"],
    [qr/\{/, "begin"],
    [qr/\}/, "end"],
    [qr/\s+/, "space"],
    [qr/[A-Za-z0-9_\-\.]*/, "word"],
    [qr/[^\\\%\=\{\}\sA-Za-z0-9_\-\.]/, "text"],
);

# tokens buffer
our(@tokfd);

# tokenize amsrefs input
TOK: while (1) {
    for my $tokrul (@toktab) {
        my($re, $id) = @$tokrul;
        if (/\G($re)/gc) {
            push @tokfd, [$id, $1];
            next TOK;
        }
    }
    #pos($_) = pos($_); # <--- line 39
    if (/\G./sgc) { # <--- line 40
        die "internal error: amsref reader tokenizer cannot match inpu
+t line: ($_) at" . pos($_);
    } elsif (/\G\z/gc) { # <--- line 42
        last;
    } else { # <--- line 44
        die "internal error: amsref reader tokenizer really cannot mat
+ch input line: ($_) " . pos($_);
    }
}

# dump tokens for debugging
for my $t (@tokfd) {
    my($i, $c) = @$t; 
    $c =~ s/\n/\\n/g;
    printf qq(%-8s "%s"\n), $i, $c;
}

__END__
[download]

You get no tokens, only the error message from line 45. Now that's clearly impossible. No matter how I'd mess up the rules in @toktab, I thought, the code could never run on that line, because if no rule matched then either there were some characters left in the input, in which case the match on line 40 would succeed and you'd get the error from line 41; or there's no characters left in which case line 42 would match and the loop would exit.

So at this point, I ask you, dear reader, explain if you can how the code could ever run to line 45, despite that I just proved that impossible.

The answer is the following. Once the buggy rule for word tokens matches the empty string at the end of the input, perl's rule against repeatedly matching the empty string kicks in, and the other regex in line 42 can't match the empty string at the end of input. If you don't know what this rule is, it's described in the section Repeated Patterns Matching a Zero-length Substring in perlre. (Without that rule, the buggy tokenizer wouldn't even reach the end of input, instead it would repeatedly extract empty word tokens at the first possible place.)

That section from the manual also tells the solution to this problem.

The additional state of being matched with zero-length is associated with the matched string, and is reset by each assignment to pos().

That's why I added the statement pos() = pos(); to line 39 of the code. This way, even if one of the tokenizer rules are wrong and can match the empty string at the end of the input string, line 42 will still match the same empty string again, thus line 45 can truly never be reached and we get an informative output.

Note that line 39 is only executed after all the rules in @toktab failed to match, so it won't cause the buggy rule to match the same empty string an infinite times repeatedly. Once line 39 is executed, we'll leave the loop one way or another.

In reply to The story of a strange line of code: pos($_) = pos($_); by ambrus

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Don't ask to ask, just ask
	PerlMonks