I've done a fair bit of assembly programming and am perfectly aware of issues of cache locality, but that issue alone doesn't make his example any more relevant.
There's also the opposite case where the inner loops don't fit into L1, which turns out an order of magnitude more costly than fetching data from the slower memory tiers. It's harder to hit that limit of course, but as most CPUs have much smaller caches for code than data it's a real possibility nevertheless, considering the appeal is that we should routinely trade speed for data size.
It's rarely quite as easy as "stop trading memory for speed!" or "lookup tables make things faster". Top performance, if at all needed in the first place, cannot be achieved any other way than by benchmarking using a low-level profiler to find out exactly where the bottlenecks are. And several bottlenecks usually interdepend so you often have to be careful not to make an "improvement" that actually worsens the overall situation.
Makeshifts last the longest.