Preliminary profiling shows that indeed CHI is the bottleneck. 95% of the time is spent outside of Data::Throttler_CHI itself, while CHI::* code itself occupies at least ~73%; the rest of the overhead involves serialization, logging, and so on. Seems like reducing the number of cache item retrievals from CHI would be the primary strategy for speed-up. Or in other words, we need to cache CHI itself, thereby defeating its purpose in the first place :-)
On the other hand, I manage to speed up Data::Throttler itself significantly simply by replacing Log::Log4perl with Log::ger. The Data::Throttler code logs a lot, so by removing logging statement (e.g. using Log::ger::Plugin::OptAway) we can cut the time into about one third!