if we let 56 processes run in parallel, the overall process takes much longer to execute than when we have 8 to 16 processes, and we strongly believe that this is due to context switches and memory usage.
Having to swap pages out and back in surely can make things run much, much longer. 56 processes talking to databases isn't really going to add enough overhead from context switching to cause the total run-time to be "much longer", IME. To get context switching to be even a moderate source of overhead, you have to force it to happen extra often by passing tiny messages back and forth way more frequently than your OS's time slice.
There are also other resources that can have a significant impact on performance if you try to over-parallelize. Different forms of caching (file system, database, L1, L2) can become much less effective if you have too many simultaneously competing uses for them (having to swap out/in pages of VM is just a specific instance of this general problem).
To a lesser extent, you can also lose performance for disk I/O by having reads be done in a less-sequential pattern. But I bet that is mostly "in the noise" compared to reduction in cache efficiency leading to things having to be read more than once each.
Poor lock granularity can also lead to greatly reduced performance when you over-parallelize, but I doubt that applies in your situation.