Fascinating work marioroy!
What I really like is that your code is mostly standard C++ that would appear
to just work on just about any modern hardware. Is that right?
I remember desperately grovelling around with all sorts of system specific hacks in The 10**21 Problem series --
such as pre-fetching, TLB, and Intel intrinsics
(e.g. search for _mm_ in The 10**21 Problem (Part 4)) to get the performance I needed.
So it seems like a dream to write standard C++ that automatically performs on all modern hardware.
For example, with NVIDIA's nvc++ compiler, would your C++ code automatically scale when run on a beast
GPGPU machine with, say, six high end NVIDIA graphics cards?