It's more complicated than you need, but there is some inline assembly in Math::Prime::Util. See mulmod.h and montmath.h. Basically just C code with preprocessor wrappers for different platforms.
You can use Devel::CheckLib or similar packages for build time configuration as well.
For testing, I recommend Linux / Unix x86 machines that match your target, an x86 that doesn't (e.g. x86 non-64), a Mac especially since they use clang which acts slightly differently than gcc (different preprocessor defines), Windows. If you're ambitious, Cygwin and a non-x86 machine.