|Perl: the Markov chain saw|
the sorry state of Perl unit testing frameworkby bulk88 (Priest)
|on Apr 07, 2015 at 06:51 UTC||Need Help??|
Updated: Test::Tiny was benchmarked and analyzed, Test::More alpha release 1.301001_101 with new backend tried
Soon I will be going to QAH 2015 in Berlin. My number one goal is to get parallel testing for TAP::Harness working on Win32 Perl but this post isn't about parallel testing. TAP::Harness stymied me many times trying to touch the codebase. It is militantly OOP, and obfuscated with its own implementation of method dispatch, and is written in a "declarative language" similar to a makefile. It has layers of faux-abstraction, that superficially purports to support pluggability, yet doesn't allow anything but the current implementation to fit the abstraction layers. I belive it is a pedagogical exercise in a ivory tower, which in plain terms, it is a homework assignment for appealing to a professor's ego that shows off all the skills itemized on your syllabus.
Researching the bloated design of TAP::Harness also brings me to investigate the other side of the TAP connection, Test::Simple/Test::More. It is equally inefficient I discovered.
All these tests were done a 2.6 ghz 2 core machine, running 32 bit WinXP. The primary tool I use for benchmarking in this post is called timeit.exe. It is a simples times() like benchmark tool that asks the NT kernel for its counters after the process exits. The resolution of these counters is 15.625 ms but the workloads I use to benchmark take seconds or minutes to complete, so 15.625 ms resolution isn't an issue.
The workload I use is always 1 million tests, the tests are run by GenTAP.pm, which is from http://github.com/bulk88/Win32-APipe. GenTAP.pm and fastprint.t, fasttinyok.t and fastok.t should be portable and run on any Perl platform if you want to try reproducing these benchmarks yourself.
I refrained from using nytprof in this write up since nytprof has overhead, and questioning individual subs, and lines of code, is pointless, if slowness is a conscious systemic design rule, not a couple bad drive-by patches over the years. The output is redirected to a file, since this way there is no overhead of the Win32 console in writing to STDOUT.
Test::Morefastok.t calls Test::More's ok() from version 1.001014, 1 million times in a loop with each test always passing, and a randomly randomized test name.
which leads to 0.000119 seconds to do 1 ok() call, or 0.1 milliseconds. This is signifigant, for each 10000 tests, 1 second of overhead. It also means, you can't run more than 10000 tests per second on per core no matter what you do. How often do you run "make test" and wait for it to finish and it feels like filling a gas tank, how slow is travis or whatever CI solution you use?
If your unit testing consists of code generated permutations of parameters to your module, 10,000 tests or 100,000 is very easily reachable on 1 software project/module. Some very popular CPAN modules do this style of code generated permutations of tests.
Now what could be the fastest possible TAP generation? "type file_of_tap.txt" or "cat file_of_tap.txt" is cheating. Under the same conditions, which is the best case scenario for the overhead of test to compare Test::More against, is a simple "print "ok ".(++$counter)." - $testname\n";" in a loop instead of "ok(1, $testname)", which is what fastprint.t does.
0.0000021 seconds per the DIY ok().
0.000119/0.0000021 57x more time. FIFTY SEVEN times more CPU. To summarize, if you use Test::More, you might as well imagine there is an gig ethernet cable and a UDP socket between your TAP emitter and TAP consumer.
Now about memory use of Test::More. I modified fastok.t as such
before 3,828KB, after 397,828 KB, peak 397,840 KB.
(397840-3828)/1000000=0.394012 KB per test emitted. 400 bytes per test. What on earth is in those 400 bytes? My test name passed to T::M::ok() is always 42 bytes long. Lets round that to the next 16 byte win32 malloc boundary, 48+12(perl's win32 ithread malloc wrapper in vmem.h)+16(sv head)+8(svpv body)+4(SV * somewhere else)=88 bytes for storing the test name. Where did the other 300 bytes go? Why is Test::More saving the names of passing tests? showing off your skills in LOC per hour for your CV? writing job-for-life unmaintable code? The TAP parser is responsible for maintain TAP state, not the TAP emitter. A TAP emitter should have no memory increase between sucessive calls to ok().
Test::More has no competitors on CPAN except for Test::Tiny which makes no attempt at API compatibility but has a similar ok() sub. So using fasttinyok.t, which calls Test::Tiny's ok() sub 1 million times, I get
Test::Tiny's ok() is (5.140/2.109=2.437) 2.4x slower than my ideal DIY ok() implementation. which compared to Test::More's 57x slower, 2.4x is a rounding error. Test::Tiny is a real working CPAN module remember.
About Test::Tiny's memory usage, using the same breakpoint positions, before 3,008KB, after 3,028KB, 3028-3008=20 KB. 20000/1000000=0.02 bytes per test, which means unmeasurable small. Basically no increase in memory usage, unlike the 100s of MBs seen with Test::More.
Even a drop in replacement for Test::More that is 10x slower than the ideal DIY ok() implementation above, or in other words 4x slower than Test::Tiny, is still 5x times faster than Test::More. Just about anything is faster than a sloth pulling a wagon. Something needs to be done about Test::More, the entire perl community relies on it, and it is unworkably slow. Either a drop in replacement, or replacing all of the internals of Test::More with a simplied architecture.
I was told to try a alpha release (1.301001101) of Test::More which included a new backend which hoped to improve its performance. I will therefore benchmark it.
The results are bad. 10 seconds more old stable 1.001014 Test::More, or 9% slower.
TAP::HarnessOnto the TAP consumer, TAP::Harness. For the next example, remember fastprint.t takes 2 seconds of CPU to print its 1 million tests. I dont think fastprint.t's process time is included by timeit.exe tool but with the numbers shown, 2 seconds is a rounding error if it is included.
(60+6)/1000000 = 0.000066 seconds for TAP::Harness to process 1 TAP test. It is better than Test::More, with TAP::Harness taking, for parsing 1 test, 55% of the time it takes, for Test::More to emit 1 test.
Now about the memory usage of TAP::Harness. Checking the process memory with Task Manager with a breakpoint right before "test_harness" sub is called, shows 5,868 KB, Windows OS shows the process peaked at 106,368 KB, and a breakpoint right after test_harness sub, shows 96,636 KB. There are 2 memory problems here that need to be broken down.
Problem 1, TAP::Harness uses about 100 bytes of memory for each *passing* test ((106368000-5868000)/1000000). The internal state of tests results isn't a sparse array or linked list of failed tests, or a vec() or C bitfield, heck, it isn't even a @array with undef/uninit slices for successful tests, which would be 4 bytes per test on a 32bit OS. It is 100 bytes per passing test. What is 100 bytes? 100/4 is 25 pointers/slice members. For reference, each scalar you create, on 32 bit OSes, is 4 slice members. And I will guess it uses a blessed hash with 1 or 2 hash keys for each test without even looking at TAP::Harness's implementation.
Problem 2, in the breakpoint, after test_harness() was executed, memory dropped from 106,368 KB to 96,636 KB. Only about 10MB. What is inside the 96,636-5,868=90,768KB?
Here is the console log with the breakpoints (the "Press any key to continue . . ." lines) to show where memory was sampled.
was already printed, so what do those 90 MB contain? Why is TAP::Harness holding onto state after printing the final line? surely this can't all be malloc fragmentation preventing a release of memory? Or was TAP::Harness written to leak memory since "the process will exit soon, dont waste CPU freeing anything", perhaps a legitimate reason? I doubt that was the intention of the person who designed TAP::Harness's OOP API.
In combination, TAP::Harness+Test::More take 0.185 ms per ok() call, of overhead. is() which is more common than ok(), will probably take even more than 0.185ms, so 0.185 ms per test, is the current best case scenario using the existing unit testing framework.
Test::MoreRewrite Test::Simple and Test::More back into standalone modules like they were 20 years ago, and remove their usage of Test::Builder would be best, but that requires Test::More's authors and maintainers to agree that the code is deeply flawed and agree to replacing it.
Summarizing some code in Test::Builder, do we really need to ever implement this, anywhere?
Since I expect signifigant protests from the peoples whoses CVs depend on protecting their precious snowflakes (see this incredulous post https://github.com/Test-More/test-more/issues/252 from 2012, leaking memory is by design and will never change by the then author), a drop in replacement for Test::More under a different namespace and patching dists away from T::M/T::B, and removing T::M/T::B from Perl 5 core is probably the easiest way forward.
TAP::HarnessRewriting TAP::Harness from scratch is probably the only solution, since a couple 3rd party modules are crazy enough to integrate themselves, like TAP::Harness::Archive. The typical "make test" has no use for TAP::Harness OOP bloat, with the only 2 options being TEST_VERBOSE on or off, and parallel or not.
I have done nytprof-ing of TAP::Harness, but nothing is fixable there without admitting all the design rules are a list of what not to do.
A simple design for a new harness would be, a TAP source class (usually a process class) that returns a stream class. The stream class returns a string name() (filename of .t, or a http:// URL of TAP or a disk path of TAP), and returns multi-KB blocks and eventually undef as EOS indicator, just 2 methods. For passing tests store nothing (undef), or store a "sparse" range of passing tests unblessed objects, store failed tests and unknown lines and diag in a linked list for dumping/summing at the end. Rewrite the parser in XS to quickly search for newline and "not ok" and "ok" token, maybe use BM algo. Even for a PP version, index and substr for a 1 pass parser through the block. For passed tests, store nothing. If a TAP stream has all passing tests, and reaches the end of stream, all the passing tests are represented by 1 hash with 2 keys (start range, end range). This is a long shot, but ideally, pipes shouldn't even be used between a TAP consumer and TAP emmitter. Future Test::More can communicate through shared memory through XS to future TAP::Harness, with the IPC buffer maybe looking like a stream, or the TAP "parsing" (no TAP is generated) is "done" (there is no TAP, just an array of test records) on the client side in Test::More, which also gives the benefit of "out of sequence" TAP being impossible due to perl thread safety in Test::More::XS.
Why is TAP::Harness's design flawed? I saw all of this being done with stepping and nytprof inside TAP::Harness.
For archival reasons, since Win32-APipe is a temporary name, and will be deleted soon. GenTAP.pm.