I learned some kinky things this morning while chatting with an acquaintance who does research in code generation for a local, well-known chip manufacturer. He and his colleagues have been struggling with the collision between traditional ideas of code optimization and the reality of modern, heavily pipelined processors. They've been struggling with how to characterize optimizations, but have been stymied by seemingly bizarre behavior in modern chips. They've isolated one situation where a single loop can exhibit one of four distinct performance profiles depending on the sequence of instructions that gets executed before the loop is entered. The way the processor pipeline is filled on the way into the loop sets up one of four stable states, each of which performs differently. They've found another case were adding an instruction to a loop improves performance by 20%.
What does this mean? It suggests that once we've optimized far enough, once we've dealt with high-level issues and are down into micro-optimizations, we're increasingly likely to encounter strange and perhaps counter-intuitive differences between sequences of nearly identical code. This might not be as much of an issue with Perl, at least until Perl6, but the possibility for bizarre low-level performance behavior is still there, as is the possibility that code that performs one way in a test harness will perform differently in the wild.