In your diagram above, draw two vertical pencil-lines, 64 bits apart, with at least 64 bits to the left of the leftmost line. (In other words, this is the second quadword in the value being sought.)
You can now see, within those two lines, all 64 of the possible values, one of which must occur in the buffer, and aligned on a quadword boundary, if that ≥128-bit value occurs in the buffer at all. Every occurrence of the needle must include one of those 64 quadwords, although you don’t know which one. The mere occurrence of one of those values also does not guarantee success, but the odds are basically 64 in 2^64 that they do.
Therefore, send the microprocessor on 64 blisteringly-fast searches through the entire buffer, looking in turn for each of those values. A page fault will never happen.
Then, search byte-wise for a pre-shifted version of what the value would be, again using fast instructions.
Only then must you test the first and last quadwords in the value, using appropriate shifting-and-masking, and doing this at a point when your success is already assured.
If you are going for “raw speed” in a modern microprocessor, you want to give it things that it can pipeline. A conditional-branch instruction of any sort often causes the predictive-branch caches to be flushed. Even though my strategy causes the microprocessor to blaze through the entire buffer 64 times in a row, (a) we know that there is RAM-enough to do that without faults, and (b) each time a value is found, success is already likely, and (c) the pipeline is full. You spend the “pedantic bit-twiddling” only to be certain of what a much-faster algorithm is already almost-certain of.