Some further analysis.
Consider the last hashing step. Xor with q8 was a ±127 adjustment at most. Multiplied by hp=1000003, the difference is less than 2**27. It is therefore unlikely that a change in q8 would carry into sign of d8. Now, hp % 1001 == 4, which is like a shift. In a sense, all the magic bits have a weight of a power of two.
> perl -e 'print join q( ), map 1000003*(1<<$_) % 1001 % 4, (0..7)'
0 0 0 0 0 0 0 0
What this means is q8 can be stepped without altering 2 lowest result bits, provided d8 does not change sign and
d >= 4*step && d < 1001-4*step.
// C D M values are all 0 (modulo 4).
int step_cdm[1001] = { 0,0,0,0, 0,1,1,1, 0,2,2,2, 0,3, ... , 0,2,2,2,
+0,1,1,1, 0,0,0,0, 0 };
int step_cdm2[2002] = {...}; // (above)x2
#define step_ok(v) (2U* ((v) & ~127)*hp < -2U* 127*hp) // *2 to disc
+ard sign
...
step = step_cdm2[d9 % 1001 + 1001];
if (step && step_ok(d7)) { q8 += step - 1; continue; } // loop q8
if (d != 500) continue;
...
Note that if this heuristic is enabled, it may leave no time for prefetching. Try without lookups. Perhaps some other modulus might have better karma, a modulus that reduces hp to odd value (e.g. 1000003 % 1039 == 485)?
Update: Actually, scratch that. I forgot q8 selects q9, which of course toggles the low bits. There might be a way to set further restrictions on allowed q8's, but this is getting pretty convoluted.