I would suggest using a combination of Parallel::ForkManager and a queue.
Use the fork-manager to spawn a certain number of processes, then let each of them consume work-requests from a queue. Or, if the requirement is simpler, let them decrement a shared variable and do so until the count reaches zero.
In general, “the number of units-of-work that are to be performed,” and “the number of parallel workers who are tasked to do the work,” are and should always remain separate. The number of workers controls what IBM used to call the multiprogramming level of the system, and that “knob” should be independent of the workload volume.