You're missing some rather fundamental points here. On some CPU architectures you can see stall times of up to 200 cycles when accessing memory that isn't in L1 or L2 cache, and with large data sets this can be a significant overhead. Significant enough that in some cases it's actually faster to have a more compact representation in memory (and thus a higher density of real data in cache) with extra code to decode that data than it is to have simple code with relatively loosely packed data.
in reply to Re: STOP Trading Memory for Speed
in thread STOP Trading Memory for Speed
For example, while it's much more efficient on most RISC architectures to use 32 bit or 64 bit data elements over 8 bit ones (as you generally have a mask and shift operation internally, or often externally), if using only 8 bit data means a larger portion of your data set fits in cache with fewer spills to main memory (or even L2 cache) the increased number of cycles actually used to decode your data is smaller than the number of cycles you pause waiting on memory busses.