Excellent work as always from marioroy. Much appreciated.
While struggling to learn OpenMP I stumbled upon Intel's OneAPI Threading Building Blocks (aka oneTBB).
For some reason, I found this library easier to understand, so decided to give it a try.
After downloading the oneapi-tbb-2021.7.0-lin.tgz release package from oneTBB 2021.7.0
and unpacking it under my Ubuntu $HOME dir and running:
. $HOME/local-oneapi-tbb/oneapi-tbb-2021.7.0/env/vars.sh
to set the oneTBB variables, I was up and running.
This is the command I used to compile C++ programs:
g++ -o llil3vec-tbb -std=c++20 -Wall -O3 -I "$HOME/local-oneapi-tbb/on
+eapi-tbb-2021.7.0/include" -L "$HOME/local-oneapi-tbb/oneapi-tbb-2021
+.7.0/lib"
llil3vec-tbb.cpp -l tbb
Update: Much later, I hit a problem with locales:
$ locale -a
C
C.utf8
POSIX
Fixed crashing par1.cpp in pgatram dir by changing:
// std::cout.imbue(std::locale{"en_US.UTF8"});
std::cout.imbue(std::locale{"C.utf8"});
in sample code from transform_reduce.
What attracted to me to TBB was the ease of trying out updating a std::map from multiple threads
with minimal changes simply by changing from:
using map_str_int_type = std::map<str_type, llil_int_type>;
to:
using map_str_int_type = tbb::concurrent_map<str_type, llil_int_type>;
While that worked, it ran a little bit slower, presumably due to the locking overhead
associated with updating the tbb::concurrent_map variable (hash_ret[word] -= count)
from multiple threads.
To avoid crashes, I further needed to break down the get_properties
function so that each thread operated on a different input file
(see get_properties_one_file function below).
I was able to get a minor speedup (similar to what I saw with OpenMP) by using this library without
any locking on a vector, as shown in the sample code below:
Timings of OpenMP vs OneTbb on my machine
OpenMp version:
$ time ./llil4vec_p big1.txt big2.txt big3.txt >vec.tmp
llil2vec (fixed string length=6) start
get_properties time : 0.853868 secs
emplace set sort time : 0.972116 secs
write stdout time : 0.875946 secs
total time : 2.70229 secs
real 0m3.041s
user 0m5.553s
sys 0m0.745s
OneTbb version:
$ time ./llil3vec-tbb big1.txt big2.txt big3.txt >f.tmp
llil3vec-tbb (fixed string length=6) start
use TBB
get_properties CPU time : 3.30477 secs
emplace set sort CPU time : 0.825981 secs
write stdout CPU time : 0.866964 secs
total CPU time : 4.99808 secs
total wall clock time : 3 secs
real 0m2.890s
user 0m4.687s
sys 0m0.646s
The real time reported by the Linux time command when running
the tbb version of 2.890s
compares favourably with 3.041s of the OpenMP version.
Apart from performing better on modern CPU caches, std::vector seems to also outperform std::map
in multi-threaded programs, due to the locking overhead of updating a global map from multiple threads.
Update: Timings for llil3vec-tbb-a.cpp below, built with clang++ and fast_io are slightly faster:
$ time ./llil3vec-tbb-a big1.txt big2.txt big3.txt >f.tmp
llil3vec-tbb-a (fixed string length=6) start
use TBB
get_properties CPU time : 2.9722 secs
emplace set sort CPU time : 0.61156 secs
write stdout CPU time : 0.828023 secs
total CPU time : 4.41257 secs
total wall clock time : 3 secs
real 0m2.552s
user 0m4.304s
sys 0m0.438s
Updated Simpler Version llil3vec-tbb-a.cpp
Update: built with:
clang++ -o llil3vec-tbb-a -std=c++20 -Wall -O3 -I "$HOME/llil/cmdlinux
+/fast_io/include" -I "$HOME/local-oneapi-tbb/oneapi-tbb-2021.7.0/incl
+ude" -L "$HOME/local-oneapi-tbb/oneapi-tbb-2021.7.0/lib" llil3vec-tbb
+-a.cpp -l tbb
Oh, and see llil4vec-tbb.cpp in Re^9: Rosetta Code: Long List is Long (faster - llil4vec - TBB code) by marioroy for a cleaner way to merge the
local array locvec[i] into vec_ret, via a scoped lock and a mutex,
thus eliminating the ugly locvec[MAX_INPUT_FILES_L] array.
Updated timings for llil4vec-tbb on my machine can be found here.
Updated: Added simpler llil3vec-tbb-a.cpp version.
|