Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

Re^6: Rosetta Code: Long List is Long (faster - vec - OpenMP)

by marioroy (Prior)
on Jan 11, 2023 at 16:51 UTC ( [id://11149530]=note: print w/replies, xml ) Need Help??


in reply to Re^5: Rosetta Code: Long List is Long (faster - vec - fast_io)
in thread Rosetta Code: Long List is Long

So far I haven't found any statistically significant speed-ups from employing this library ...

After closer look, it was the -std=c++20 language mode that enables faster vectors by 0.2 ~ 0.4 seconds versus -std=c++11.

Update 1: Support variable length words.
Update 2: Enable parallel sort. See Using Parallel Mode.

I tried OpenMP. Unfortunately, strtok is not thread-safe e.g. strtok(NULL, "\n") causing segfault. So I factored out strtok. The OpenMP result improved by 0.1 seconds. That's because the actual reading is already fast. It takes 2 threads minimally to run faster than non-OpenMP results due to populating vec_rec from local copies.

Building:

clang++ -o llil4vec -std=c++11 -Wall -O3 llil4vec.cpp clang++ -o llil4vec -std=c++20 -Wall -O3 llil4vec.cpp # faster # enable parallel via -fopenmp clang++ -o llil4vec-omp -std=c++11 -fopenmp -Wall -O3 llil4vec.cpp clang++ -o llil4vec-omp -std=c++20 -fopenmp -Wall -O3 llil4vec.cpp # +faster

Running - Real time results:

$ time ./llil4vec big1.txt big2.txt big3.txt >out.txt std:c++11: 2.901 secs std:c++20: 2.850 secs $ time OMP_NUM_THREADS=3 ./llil4vec-omp big1.txt big2.txt big3.txt >ou +t.txt std:c++11: 2.308 secs std:c++20: 2.267 secs

llil4vec.cpp modification, OpenMP-aware:

#if defined(_OPENMP) #include <omp.h> #include <parallel/algorithm> #else #include <algorithm> #endif ... static void get_properties( int nfiles, // in: the number of input files char* fname[], // in: the input file names vec_int_str_type& vec_ret) // out: a vector of properties { #if defined(_OPENMP) omp_set_dynamic(0); omp_set_max_active_levels(1); #pragma omp parallel { vec_int_str_type vec_loc; // thread local copy #pragma omp for nowait schedule(static,1) #endif for (int i = 0; i < nfiles; ++i) { char line[MAX_LINE_LEN_L+1]; char* found; llil_int_type count; FILE* fh = ::fopen(fname[i], "r"); if (fh == NULL) { std::cerr << "Error opening '" << fname[i] << "' : errno=" + << errno << "\n"; continue; } while ( ::fgets(line, MAX_LINE_LEN_L, fh) != NULL ) { found = ::strchr(line, '\t'); count = ::atoll( &line[found - line + 1] ); line[found - line] = '\0'; // word #ifdef MAX_STR_LEN_L str_type fixword { { '\0', '\0', '\0', '\0', '\0', '\0' } +}; ::memcpy( fixword.data(), line, found - line ); #if defined(_OPENMP) vec_loc.emplace_back( -count, fixword ); #else vec_ret.emplace_back( -count, fixword ); #endif #else #if defined(_OPENMP) vec_loc.emplace_back( -count, line ); #else vec_ret.emplace_back( -count, line ); #endif #endif } ::fclose(fh); } #if defined(_OPENMP) #pragma omp critical vec_ret.insert(vec_ret.end(), vec_loc.begin(), vec_loc.end()); } #endif // Needs to be sorted by word for later sum of adjacent count field +s to work #if defined(_OPENMP) __gnu_parallel::sort( vec_ret.begin(), vec_ret.end(), [](const int_str_type& left, const int_str_type& right) { return + left.second < right.second; } ); #else std::sort( vec_ret.begin(), vec_ret.end(), [](const int_str_type& left, const int_str_type& right) { return + left.second < right.second; } ); #endif }

Replies are listed 'Best First'.
Re^7: Rosetta Code: Long List is Long (faster - llil4vec - OpenMP code)
by marioroy (Prior) on Jan 12, 2023 at 09:55 UTC

    Helpful, OpenMP Little Book which introduces OpenMP C/C++ concurrency programming.

    OpenMP spawns a thread per logical core by default. Therefore, I set the desired number of threads via the OMP_NUM_THREADS environment variable (... or NUM_THREADS ...) to minimize extra threads creation and reaping.

    $ NUM_THREADS=3 ./llil4vec big1.txt big2.txt big3.txt >out.txt llil4vec (fixed string length=12) start use OpenMP use boost sort get properties 0.131 secs sort properties 0.191 secs vector reduce 0.036 secs vector stable sort 0.267 secs write stdout 0.191 secs total time 0.818 secs

    llil4vec.cc:

      Which is more efficient w.r.t. CPU utilization between g++ and clang++? Let's find out with 30 input files and pay close attention to user times. g++ wins for parallel sort as lesser user time is better considering real times are similar.

      cp big1.txt big4.txt cp big2.txt big5.txt cp big3.txt big6.txt cp big1.txt big7.txt cp big2.txt big8.txt cp big3.txt big9.txt cp big1.txt biga.txt cp big2.txt bigb.txt cp big3.txt bigc.txt cp big1.txt bigd.txt cp big2.txt bige.txt cp big3.txt bigf.txt cp big1.txt bigg.txt cp big2.txt bigh.txt cp big3.txt bigi.txt cp big1.txt bigj.txt cp big2.txt bigk.txt cp big3.txt bigl.txt cp big1.txt bigm.txt cp big2.txt bign.txt cp big3.txt bigo.txt cp big1.txt bigp.txt cp big2.txt bigq.txt cp big3.txt bigr.txt cp big1.txt bigs.txt cp big2.txt bigt.txt cp big3.txt bigu.txt

      g++

      $ g++ -o llil4vec-omp -std=c++20 -fopenmp -Wall -O3 llil4vec.cpp -I./f +ast_io/include $ time OMP_NUM_THREADS=30 ./llil4vec-omp big?.txt >out.txt llil4vec (fixed string length=6) start use OpenMP get properties time : 0.937948 secs sort properties time : 0.494592 secs emplace set sort time : 1.02668 secs write stdout time : 0.50569 secs total time : 2.96497 secs real 0m3.271s <-- total wall time, performs similarly user 0m16.823s <-- nothing unusual, fans seem quiet sys 0m4.188s

      clang++

      $ clang++ -o llil4vec-omp -std=c++20 -fopenmp -Wall -O3 llil4vec.cpp - +I./fast_io/include $ time OMP_NUM_THREADS=30 ./llil4vec-omp big?.txt >out.txt llil4vec (fixed string length=6) start use OpenMP get properties time : 1.07873 secs sort properties time : 0.717261 secs emplace set sort time : 0.606499 secs write stdout time : 0.437287 secs total time : 2.83984 secs real 0m3.201s <-- total wall time, performs similarly user 0m45.889s <-- system fans noticeably louder sys 0m5.934s

      Next, I tried J Script with AVX2 enabled for comparison. 30 threads, specified in the script.

      $ time ~/j904/bin/jconsole llil4_p.ijs big?.txt out.txt Read and parse input: 2.43996 Classify, sum, sort: 4.63914 Format and write output: 0.848965 Total time: 7.92807 real 0m7.934s <-- total wall time user 0m24.782s <-- this too, system fans roaring sys 0m26.764s <-- because user and sys combined

      I find it interesting for clang++'s user time to be nearly 3x that of g++.

      Excellent work as always from marioroy. Much appreciated.

      While struggling to learn OpenMP I stumbled upon Intel's OneAPI Threading Building Blocks (aka oneTBB). For some reason, I found this library easier to understand, so decided to give it a try. After downloading the oneapi-tbb-2021.7.0-lin.tgz release package from oneTBB 2021.7.0 and unpacking it under my Ubuntu $HOME dir and running:

      . $HOME/local-oneapi-tbb/oneapi-tbb-2021.7.0/env/vars.sh
      to set the oneTBB variables, I was up and running. This is the command I used to compile C++ programs:
      g++ -o llil3vec-tbb -std=c++20 -Wall -O3 -I "$HOME/local-oneapi-tbb/on +eapi-tbb-2021.7.0/include" -L "$HOME/local-oneapi-tbb/oneapi-tbb-2021 +.7.0/lib" llil3vec-tbb.cpp -l tbb

      Update: Much later, I hit a problem with locales:

      $ locale -a C C.utf8 POSIX
      Fixed crashing par1.cpp in pgatram dir by changing:
      // std::cout.imbue(std::locale{"en_US.UTF8"}); std::cout.imbue(std::locale{"C.utf8"});
      in sample code from transform_reduce.

      What attracted to me to TBB was the ease of trying out updating a std::map from multiple threads with minimal changes simply by changing from:

      using map_str_int_type = std::map<str_type, llil_int_type>;
      to:
      using map_str_int_type = tbb::concurrent_map<str_type, llil_int_type>;

      While that worked, it ran a little bit slower, presumably due to the locking overhead associated with updating the tbb::concurrent_map variable (hash_ret[word] -= count) from multiple threads. To avoid crashes, I further needed to break down the get_properties function so that each thread operated on a different input file (see get_properties_one_file function below).

      I was able to get a minor speedup (similar to what I saw with OpenMP) by using this library without any locking on a vector, as shown in the sample code below:

      Timings of OpenMP vs OneTbb on my machine

      OpenMp version: $ time ./llil4vec_p big1.txt big2.txt big3.txt >vec.tmp llil2vec (fixed string length=6) start get_properties time : 0.853868 secs emplace set sort time : 0.972116 secs write stdout time : 0.875946 secs total time : 2.70229 secs real 0m3.041s user 0m5.553s sys 0m0.745s

      OneTbb version: $ time ./llil3vec-tbb big1.txt big2.txt big3.txt >f.tmp llil3vec-tbb (fixed string length=6) start use TBB get_properties CPU time : 3.30477 secs emplace set sort CPU time : 0.825981 secs write stdout CPU time : 0.866964 secs total CPU time : 4.99808 secs total wall clock time : 3 secs real 0m2.890s user 0m4.687s sys 0m0.646s

      The real time reported by the Linux time command when running the tbb version of 2.890s compares favourably with 3.041s of the OpenMP version.

      Apart from performing better on modern CPU caches, std::vector seems to also outperform std::map in multi-threaded programs, due to the locking overhead of updating a global map from multiple threads.

      Update: Timings for llil3vec-tbb-a.cpp below, built with clang++ and fast_io are slightly faster:

      $ time ./llil3vec-tbb-a big1.txt big2.txt big3.txt >f.tmp llil3vec-tbb-a (fixed string length=6) start use TBB get_properties CPU time : 2.9722 secs emplace set sort CPU time : 0.61156 secs write stdout CPU time : 0.828023 secs total CPU time : 4.41257 secs total wall clock time : 3 secs real 0m2.552s user 0m4.304s sys 0m0.438s

      Updated Simpler Version llil3vec-tbb-a.cpp

      Update: built with:

      clang++ -o llil3vec-tbb-a -std=c++20 -Wall -O3 -I "$HOME/llil/cmdlinux +/fast_io/include" -I "$HOME/local-oneapi-tbb/oneapi-tbb-2021.7.0/incl +ude" -L "$HOME/local-oneapi-tbb/oneapi-tbb-2021.7.0/lib" llil3vec-tbb +-a.cpp -l tbb

      Oh, and see llil4vec-tbb.cpp in Re^9: Rosetta Code: Long List is Long (faster - llil4vec - TBB code) by marioroy for a cleaner way to merge the local array locvec[i] into vec_ret, via a scoped lock and a mutex, thus eliminating the ugly locvec[MAX_INPUT_FILES_L] array. Updated timings for llil4vec-tbb on my machine can be found here.

      Updated: Added simpler llil3vec-tbb-a.cpp version.

        Thank you, eyepopslikeamosquito. How nice to have a TBB implementation! OS vendors may have the TBB development package. I'm running Clear Linux and the TBB runtime libraries are installed with the OS. All I needed was the TBB dev bundle.

        Aloha eyepopslikeamosquito,

        I made an update to your simpler llil3vec-tbb-a.cpp version. I captured results and posted them here. The llil4vec OpenMP impl can be found here, for comparison.

        Before posting, I ran with 121 MB input files (many of them). I saw a 5 ~ 6 second gap in favor of OpenMP; e.g. 1 core running for a while between parallel_for and parallel_sort. This is resolved by having threads merge immediately after processing a file, similarly to the OpenMP implementation. Now, TBB performs well for small, big, or many input files.

        llil4vec-tbb.cc

Re^7: Rosetta Code: Long List is Long (faster - vec - OpenMP)
by eyepopslikeamosquito (Archbishop) on Jan 11, 2023 at 20:59 UTC

    Very interesting! This will help me get started with OpenMP, which I've been meaning to do for a while now.

    Unfortunately, strtok is not thread-safe e.g. strtok(NULL, "\n") causing segfault. So I factored out strtok.

    Ha ha, that I used strtok at all is an indication of how desperate I was, given I singled this function out for a special dishonourable mention at On Interfaces and APIs. :)

    To round out this section, notice that the ANSI C strtok function has a shocker of an interface: it surprises by writing NULLs into the input string being tokenized; the first call has different semantics to subsequent calls (distinguished by special-case logic on the value of its first parameter); and it stores state between calls so that only one sequence of calls can be active at a time (bad enough in a single-threaded environment, but so intolerable in multi-threaded and signal handler (reentrant) environments that the POSIX threads committee invented a replacement strtok_r function).

      It was weird experiencing the segfault due to strtok(NULL, "\n").

      Great! What is strtok_r() function in C language? I compared strtok_r with context against strchr. After testing, I updated the OpenMP demonstration to find the tab character within the string. Now passing: limited length and variable length words, including OpenMP. Update 2 enables parallel sort.

      # 15 input files, 1 thread, clang++ -std=c++20 strtok_r 9.655 seconds strchr 8.713 seconds
      while ( ::fgets(line, MAX_LINE_LEN_L, fh) != NULL ) { found = ::strchr(line, '\t'); count = ::atoll( &line[found - line + 1] ); line[found - line] = '\0'; // word ... }

      Also, I replaced strlen(...) with found - line.

      ::memcpy( fixword.data(), line, found - line );

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11149530]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others studying the Monastery: (6)
As of 2024-04-19 14:15 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found