If you're working with 3-byte groups, that suspiciously sounds to me like you're treating RGB-data, and even if you're not, there is a crazy yet highly efficient (I believe) way of manipulating such data if you have the hardware (and software) to do it:
Using OpenGL and a chroma-key filter, you can "draw" the two strings over each other in parallel and then retrieve the resulting "image" from the massively parallel hardware again. It's not always certain that the parallelism you gain by offloading the work to the GPU outweighs the cost of transferring the data over the bus and the result back again, especially when benchmarked against the C versions you already have. See http://www.gpgpu.org/ for more information on the concept.