Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

The story of a strange line of code: pos($_) = pos($_);

by ambrus (Abbot)
on Mar 09, 2010 at 23:37 UTC ( [id://827649]=perlmeditation: print w/replies, xml ) Need Help??

This meditation answers the question why writing pos($_) = pos($_); in a perl program could ever make sense.

The following code tokenizes AMSrefs format. It is an excerpt from an actual code I wrote, modified slightly here to run standalone. (You don't need to know what AMSrefs is to understand this meditation, but it's a text format to describe bibliographic references in scientific publications, similar to BibTeX.)

use warnings; use strict; # example input $_ = q( \bib{ref0}{article}{ author={Y. Bartal}, volume={37}, pages={184}, date={1996}, issn={0272-5428}, } );
# tokenizer rules our @toktab = ( [qr/\\bib(?![A-Za-z])/, "bib"], [qr/\\(?:[A-Za-z]+|.)/s, "text"], [qr/\%.*\n\s*/, "comment"], [qr/\=/, "equal"], [qr/\{/, "begin"], [qr/\}/, "end"], [qr/\s+/, "space"], [qr/[A-Za-z0-9_\-\.]+/, "word"], [qr/[^\\\%\=\{\}\sA-Za-z0-9_\-\.]/, "text"], ); # tokens buffer our(@tokfd); # tokenize amsrefs input TOK: while (1) { for my $tokrul (@toktab) { my($re, $id) = @$tokrul; if (/\G($re)/gc) { push @tokfd, [$id, $1]; next TOK; } } pos($_) = pos($_); # <--- line 39 if (/\G./sgc) { # <--- line 40 die "internal error: amsref reader tokenizer cannot match inpu +t line: ($_) at" . pos($_); } elsif (/\G\z/gc) { # <--- line 42 last; } else { # <--- line 44 die "internal error: amsref reader tokenizer really cannot mat +ch input line: ($_) " . pos($_); } } # dump tokens for debugging for my $t (@tokfd) { my($i, $c) = @$t; $c =~ s/\n/\\n/g; printf qq(%-8s "%s"\n), $i, $c; } __END__

As you can see, line 39 has the expression pos($_) = pos($_). This doesn't seem to do anything useful, so I'll explain here why I added it to the code.

If you run this code, you see that it dumps the type and content of each token it finds. When writing the code, I tested it exactly this way: I ran it on some example input and made it print the tokens it's got. I would think that if I made an error in one of the tokenizing rules, it would be easy to find out by examining the output. If a rule would match too much, I would see tokens in the output where there's no such token; if it would match too little, I would see either other tokens matching that part of the text, or, at worse, the fallback rule in line 41 kick in if no other rule would match.

Indeed, I made some mistake in one of the regexes in @toktab, and this was such that regex would accidentally match the empty string. I'm not sure what the exact mistake was, but let's assume that I wrote [qr/[A-Za-z0-9_\-\.]*/, "word"], instead of [qr/[A-Za-z0-9_\-\.]+/, "word"],. Per what I said above, one would think that this mistake was easy to recognize: we'd get lots of extra tokens with type word with the empty string as content. Indeed, if you change the regex this way in this code, you get that result.

That's not the output I'd see at that time though. If you both introduce this mistake in the regex and remove line 39 from the above code, so you get the following code, and run it, you'll see what I have got.

use warnings; use strict; # example input $_ = q( \bib{ref0}{article}{ author={Y. Bartal}, volume={37}, pages={184}, date={1996}, issn={0272-5428}, } ); # tokenizer rules our @toktab = ( [qr/\\bib(?![A-Za-z])/, "bib"], [qr/\\(?:[A-Za-z]+|.)/s, "text"], [qr/\%.*\n\s*/, "comment"], [qr/\=/, "equal"], [qr/\{/, "begin"], [qr/\}/, "end"], [qr/\s+/, "space"], [qr/[A-Za-z0-9_\-\.]*/, "word"], [qr/[^\\\%\=\{\}\sA-Za-z0-9_\-\.]/, "text"], ); # tokens buffer our(@tokfd); # tokenize amsrefs input TOK: while (1) { for my $tokrul (@toktab) { my($re, $id) = @$tokrul; if (/\G($re)/gc) { push @tokfd, [$id, $1]; next TOK; } } #pos($_) = pos($_); # <--- line 39 if (/\G./sgc) { # <--- line 40 die "internal error: amsref reader tokenizer cannot match inpu +t line: ($_) at" . pos($_); } elsif (/\G\z/gc) { # <--- line 42 last; } else { # <--- line 44 die "internal error: amsref reader tokenizer really cannot mat +ch input line: ($_) " . pos($_); } } # dump tokens for debugging for my $t (@tokfd) { my($i, $c) = @$t; $c =~ s/\n/\\n/g; printf qq(%-8s "%s"\n), $i, $c; } __END__

You get no tokens, only the error message from line 45. Now that's clearly impossible. No matter how I'd mess up the rules in @toktab, I thought, the code could never run on that line, because if no rule matched then either there were some characters left in the input, in which case the match on line 40 would succeed and you'd get the error from line 41; or there's no characters left in which case line 42 would match and the loop would exit.

So at this point, I ask you, dear reader, explain if you can how the code could ever run to line 45, despite that I just proved that impossible.

The answer is the following. Once the buggy rule for word tokens matches the empty string at the end of the input, perl's rule against repeatedly matching the empty string kicks in, and the other regex in line 42 can't match the empty string at the end of input. If you don't know what this rule is, it's described in the section Repeated Patterns Matching a Zero-length Substring in perlre. (Without that rule, the buggy tokenizer wouldn't even reach the end of input, instead it would repeatedly extract empty word tokens at the first possible place.)

That section from the manual also tells the solution to this problem.

The additional state of being matched with zero-length is associated with the matched string, and is reset by each assignment to pos().

That's why I added the statement pos() = pos(); to line 39 of the code. This way, even if one of the tokenizer rules are wrong and can match the empty string at the end of the input string, line 42 will still match the same empty string again, thus line 45 can truly never be reached and we get an informative output.

Note that line 39 is only executed after all the rules in @toktab failed to match, so it won't cause the buggy rule to match the same empty string an infinite times repeatedly. Once line 39 is executed, we'll leave the loop one way or another.

Replies are listed 'Best First'.
Re: The story of a strange line of code: pos($_) = pos($_);
by rubasov (Friar) on Mar 10, 2010 at 02:06 UTC
    For the sake of TIMTOWTDI I've tried to rewrite your code a little, by moving much of your explicit looping logic into the regex, letting the regex engine do the dirty work. Here it is:
    use strict; use warnings; $_ = q( \bib{ref0}{article}{ author={Y. Bartal}, volume={37}, pages={184}, date={1996}, issn={0272-5428}, } ); my @tokfd; my $tokre = qr{ (?<bib> \\bib(?![A-Za-z]) ) | (?<text> (?s: \\(?:[A-Za-z]+|.) ) ) | (?<comment> \%.*\n\s* ) | (?<equal> \= ) | (?<begin> \{ ) | (?<end> \} ) | (?<space> \s+ ) | (?<word> [A-Za-z0-9_\-\.]+ ) | (?<text> [^\\\%\=\{\}\sA-Za-z0-9_\-\.] ) }x; push @tokfd, [ keys %+, values %+ ] while /\G$tokre/gc; die "internal error: amsref reader tokenizer cannot match input line: +($_) at" . pos($_) if ( $+[0] != length ); for my $t (@tokfd) { my ( $i, $c ) = @$t; $c =~ s/\n/\\n/g; printf qq(%-8s "%s"\n), $i, $c; }
    I've used regex branches instead of your for loop, and moved the matching into the while condition to eliminate the explicit loop control and to avoid the repeated zero-length matches. I've replaced the AoA with named captures.

    As far as I can tell it produces the same output as yours, but I think it's a little more concise. It is also easy to see in the output when you accidentally make a branch matching the null string.

    I hope it is to your liking.
Re: The story of a strange line of code: pos($_) = pos($_);
by ikegami (Patriarch) on Mar 10, 2010 at 01:53 UTC

    let's assume that I wrote [qr/[A-Za-z0-9_\-\.]*/, "word"], instead of [qr/[A-Za-z0-9_\-\.]+/, "word"]

    Couldn't that give you an infinite loop? ( Your particular set of rules and code layout is immune, but that's not always going to be the case. ) If so, reseting pos() is making things worse. An error you used to catch (with a misleading error message) now fails badly.

    While it's an interesting tidbit, the need for the construct in a tokeniser is a strong indicator of an error elsewhere (as is the case here).

      No, it can't give an infinite loop, exactly because of the empty match rule.

      As I noted in the very last paragraph in the post, resetting pos is not making this worse because we don't recall the main regexen from @toktab after we reset pos.

Re: The story of a strange line of code: pos($_) = pos($_);
by LanX (Saint) on Mar 10, 2010 at 12:31 UTC
    Hi ambrus

    Yes, combining \G and pos() is buggy. See also [bugs?] perldoc perlre, \G and pos() ...¹

    Actually I wanted to send a bug report ... :-(

    Cheers Rolf

    UPDATEs: ¹) pos()=pos() is a workaround I used, indicating an implementation problem!

      His problem has nothing to do with pos. pos was his solution.
        and pos($_) = pos($_); was my solution when dealing with \G.

        But I have to admit that I'm not understanding everything ambrus is talking about.

        Cheers Rolf

        UPDATE: corrected typo! (thx ikegami)

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlmeditation [id://827649]
Approved by ww
Front-paged by ww
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others cooling their heels in the Monastery: (3)
As of 2024-03-19 11:14 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found