|Don't ask to ask, just ask|
One of the fastest general Life algorithms I know of is vlife, used by Adam P. Goucher in apgmera. It supports arbitrary (outer-totalistic) rules; the generic bits are written in C++ (and live in includes/vlife.h), while the rule-specific parts are in assembly, using SSE2/AVX1/AVX2 (whatever is available), generated by a Python script. It's a very clever algorithm using very clever data structures that probably achieves close to the maximum of what you can eke out of a modern CPU, speed-wise, at least without resorting to even newer instruction sets.
It includes a benchmark in which a well-known methuselah is run for 30k generations (about as long as it takes to stabilize). make vlifetest and run the resulting vlifetest binary to run it -- it would be interesting to compare this to your C++ and Perl implementations, as well as CPAN's offerings. On my machine (which is a couple of years old and only supports AVX1), the average time across the benchmark's 50 iterations is 135.72 ms:
$ ./vlifetest VersaTile size: 336 Instruction set AVX1 supported. Population count: 1623 Tiles processed: 596194 [...] Lidka + 30k in 135.72 ms. $
EDIT: on the same machine, tbench1.cpp takes about a minute to run Lidka to completion:
$ ./tbench1 lidka_106.lif 30000 cell count at start = 13 run benchmark for 30000 ticks cell count at end = 1623 time taken 60 secs $
The benchmark of the Perl implementation is still running.
EDIT 2: Perl ended up being faster than I thought, but at more than 40 minutes, it's still pretty slow compared to the C++ version, much less vlife:
$ perl tbench1.pl lidka_106.lif 30000 cell count at start = 13 run benchmark for 30000 ticks cell count at end = 1623 time taken: 2441 secs $
EDIT 3: apgmera has since seen a new major release (4.x), and the algorithmic guts now live in a separate repo, lifelib. The author informs me that it's actually about 8% faster compared to the vlife code in 3.x, too.