Counting Primes
It took me some time (on-and-off) getting the OpenMP demonstrations to perform similar to Perl MCE + Inline::C. Counting prime numbers only, primes1.c now performs like algorithm3.pl. Likewise, primes3.c and primes4.codon perform like the primesieve binary or primesieve.pl.
Testing was done on a 32-core machine.
# Algorimth3
$ ./bin/algorithm3.pl 1e12
Primes found: 37607912018
Seconds: 14.711
$ ./demos/primes1.gcc 1e12
Primes found: 37607912018
Seconds: 14.499
$ ./demos/primes1.clang 1e12
Primes found: 37607912018
Seconds: 14.587
$ ./demos/primes1.nvc 1e12
Primes found: 37607912018
Seconds: 14.858
$ ./demos/primes2 1e12
Primes found: 37607912018
Seconds: 20.204
# Primesieve
$ /usr/local/bin/primesieve 1e12
Sieve size = 256 KiB
Threads = 64
100%
Seconds: 5.597
Primes: 37607912018
$ ./bin/primesieve.pl 1e12
Primes found: 37607912018
Seconds: 5.707
$ ./demos/primes3.gcc 1e12
Primes found: 37607912018
Seconds: 5.696
$ ./demos/primes3.clang 1e12
Primes found: 37607912018
Seconds: 5.767
$ ./demos/primes3.nvc 1e12
Primes found: 37607912018
Seconds: 5.841
$ ./demos/primes4 1e12
Primes found: 37607912018
Seconds: 5.719
Printing Primes
Outputting prime numbers is another story. Workers using MCE output to /dev/shm location in parallel, passing the chunk_id to the manager process to output orderly. This is very fast. The C and Codon demonstrations write directly to STDOUT, orderly. Here, threads wait their turn.
The saddest moment was witnessing OpenMP consume unnecessary power consumption for waiting threads. I created an issue ticket for LLVM OpenMP and NVIDIA HPC OpenMP. IMHO, only GCC OpenMP pass in this regard. This is the reason GCC ran faster compared to CLANG and NVIDIA NVC.
Output size for 1e10 is 4.6 GB. Be sure to direct to a command (i.e. cksum) or /dev/null.
# Algorithm3
$ ./bin/algorithm3.pl 1e10 -p >/dev/null
Seconds: 0.743
$ ./demos/primes1.gcc 1e10 -p >/dev/null
Seconds: 10.249
$ ./demos/primes1.clang 1e10 -p >/dev/null
Seconds: 12.696
$ ./demos/primes1.nvc 1e10 -p >/dev/null
Seconds: 14.326
$ ./demos/primes2 1e10 -p >/dev/null
Seconds: 12.369
# Primesieve
# the primesieve binary uses one core when -p is given
$ time /usr/local/bin/primesieve 1e10 -p >/dev/null
Seconds: 14.379
$ ./bin/primesieve.pl 1e10 -p >/dev/null
Seconds: 0.680
$ ./demos/primes3.gcc 1e10 -p >/dev/null
Seconds: 7.145
$ ./demos/primes3.clang 1e10 -p >/dev/null
Seconds: 8.826
$ ./demos/primes3.nvc 1e10 -p >/dev/null
Seconds: 11.249
$ ./demos/primes4 1e10 -p >/dev/null
Seconds: 8.597