Recently, I tried NVIDIA's nvc++ compiler and noticed a compile time regression (> 40 seconds). That's a long time. I reached out to NVIDIA. Though, no resolution from their side. Some time later, I saw another regression upgrading GCC from version 13 to 14. It seems that std::mutex regressed.
That spawned a chain-reaction. I logged-appended the March and April 2024 events to my summary page.
A spinlock mutex class resolved both issues, further improving performance. I reached out to Greg, author of the phmap C++ library. The LLiL challenge is interesting. He shared tips plus the string_cnt struct, accommodating fixed-length and variable-length keys.
Clang performance regression due to GCC 14
phmap: Ability to reset the inner sub map
string_cnt struct commit by Greg
eyepopslikeamosquito's llil3vec.cpp inspiration (scroll down the page), for making the vector version llil4vec fast (though gobbles memory for unlimited length strings), is the reason efforts to making the map variants catch up, while keeping memory consumption low. There are several map variants. The llil4hmap, llil4emh, and llil4umap demonstrations compute the key hash_value once only, and stores it with with the key.
llil4map sub maps managed by the C++ library, using phmap::parallel_flat_hash_map
llil4map2 memory efficient version, vector of pointers to the map key-value pairs
llil4hmap sub maps managed by the application, using phmap::flat_hash_map
llil4emh sub maps managed by the application, using emhash7::HashMap
llil4umap sub maps managed by the application, using std::unordered_map
Greg Popovitch added a clever new type string_cnt, supporting fixed-length and long strings without recompiling. See enhanced llil4map found in his examples folder, also memory efficient. Note: I beautified the include directives in my demonstrations, matching your version for consistency.
I glanced through the long thread by eyepopslikeamosquito. At the time, we were thrilled completing in less than 10 seconds. More so below 6 seconds. Fast forward to early 2024, the llil4map demonstration completes in 0.3 seconds, due to linear scaling capabilities (all levels parallel). The llil4vec example may run faster, but stop improving beyond more CPU cores.
We reached the sub-second territory, processing three input files.
Limit 12 CPU Cores
$ NUM_THREADS=12 ./llil4map big{1,2,3}.txt | cksum
llil4map (fixed string length=12) start
use OpenMP
use boost sort
get properties 0.311 secs
map to vector 0.055 secs
vector stable sort 0.095 secs
write stdout 0.036 secs
total time 0.500 secs
count lines 10545600
count unique 10367603
2956888413 93308427
$ NUM_THREADS=12 ./llil4vec big{1,2,3}.txt | cksum
llil4vec (fixed string length=12) start
use OpenMP
use boost sort
get properties 0.170 secs
sort properties 0.061 secs
vector reduce 0.017 secs
vector stable sort 0.065 secs
write stdout 0.047 secs
total time 0.362 secs
count lines 10545600
count unique 10367603
2956888413 93308427
No Limit
$ ./llil4map big{1,2,3}.txt | cksum
llil4map (fixed string length=12) start
use OpenMP
use boost sort
get properties 0.101 secs
map to vector 0.052 secs
vector stable sort 0.115 secs
write stdout 0.029 secs
total time 0.298 secs
count lines 10545600
count unique 10367603
2956888413 93308427
$ ./llil4vec big{1,2,3}.txt | cksum
llil4vec (fixed string length=12) start
use OpenMP
use boost sort
get properties 0.203 secs
sort properties 0.088 secs
vector reduce 0.024 secs
vector stable sort 0.103 secs
write stdout 0.029 secs
total time 0.449 secs
count lines 10545600
count unique 10367603
2956888413 93308427
Many thanks eyepopslikeamosquito for being there. I promise no more messages for a while. You inspired C++ in me. Next is Taichi Lang.
Greg Popovitch shared a tip for releasing memory immediately, during "map to vector". Thank you. That was helpful also, for llil4umap.
Blessings and grace,
Re: [OT] The Long List is Long resurrected
by eyepopslikeamosquito (Archbishop) on Apr 07, 2024 at 12:04 UTC
|
Fascinating work marioroy!
What I really like is that your code is mostly standard C++ that would appear
to just work on just about any modern hardware. Is that right?
I remember desperately grovelling around with all sorts of system specific hacks in The 10**21 Problem series --
such as pre-fetching, TLB, and Intel intrinsics
(e.g. search for _mm_ in The 10**21 Problem (Part 4)) to get the performance I needed.
So it seems like a dream to write standard C++ that automatically performs on all modern hardware.
For example, with NVIDIA's nvc++ compiler, would your C++ code automatically scale when run on a beast
GPGPU machine with, say, six high end NVIDIA graphics cards?
| [reply] [d/l] [select] |
|
> What I really like is that your code is mostly standard C++ that would appear to just work on just about any modern hardware. Is that right?
Replacing std::mutex with spinlock_mutex resolved the issue with nvc++ taking over 40 seconds to build llil4hmap.cc and llil4emh.cc. Though peculiar, I reached out to NVIDIA about it. All is well otherwise. The binary performs similarly to clang++.
> Would your C++ code automatically scale when run on a beast GPGPU machine with, say, six high end NVIDIA graphics cards?
Regarding llil4map/vec, the first step is replacing the OpenMP directives with the C++17 (or higher), parallel and vector concurrency via execution policies. Get properties may not run on the GPU. But no reason why sorting cannot. However, the GPU may lack sufficient memory capacity for the LLiL challenge.
From the artical, "The GPU version also uses the parallel execution policy, but is compiled with nvc++ and the -stdpar compiler option."
Accelerating Standard C++ with GPUs Using stdpar
Accelerating Python on GPUs with nvc++ and Cython
NVIDIA HPC Standard Language Parallelism, C++ PDF document
See also C++ Parallel Algorithms.
The NVIDIA HPC SDK is currently version 23.9, released recently.
| [reply] [d/l] [select] |
Re: [OT] The Long List is Long resurrected
by marioroy (Prior) on Apr 16, 2024 at 04:51 UTC
|
The journey ended wonderfully.
Below, the gap between total time and real time is the GC time by the OS i.e. freeing dynamically allocated memory. C++ strings are dynamically allocated if greater than the small / Short String Optimization (SSO) limit; 16.
llil4map
$ time ./llil4map long* long* long* | cksum
llil4map start
use OpenMP ( old ) ( new )
memory consumption 36.6 GB 17.5 GB
use boost sort
get properties 12.967s 11.836s
map to vector 3.244s 1.292s
vector stable sort 10.265s 11.025s
write stdout 1.410s 1.364s
total time 27.887s 25.519s
count lines 970195200
count unique 295755152
29612263 5038456270
real time 0m43.973s 0m27.461s
user 25m08.532s 20m02.479s
sys 0m55.602s 0m43.250s
llil4vec
$ time ./llil4vec long* long* long* | cksum
llil4vec start
use OpenMP ( old ) (mem-fix) ( new )
memory consumption 89.6 GB 62.4 GB 33.0 GB
use boost sort
get properties 48.133s 32.659s 12.993s
sort properties 39.684s 39.399s 33.421s
vector reduce 87.982s 75.644s 35.568s
vector stable sort 25.990s 31.133s 21.987s
write stdout 3.707s 3.206s 3.693s
total time 205.498s 182.042s 107.664s
count lines 970195200
count unique 295755152
29612263 5038456270
real time 3m57.228s 3m38.118s 1m50.092s
user 73m27.931s 70m13.776s 45m32.094s
sys 2m42.941s 2m29.584s 2m20.917s
Discoveries
I found the issue with the vector variant consuming huge memory. The std::make_move_iterator is making a copy, not move.
#pragma omp parallel for schedule(static, 1) reduction(+:num_lines)
for (int i = 0; i < nfiles; ++i) {
vec_str_int_type locvec;
num_lines += get_properties(fname[i], locvec);
#pragma omp critical
{
// Append local vector to propvec (consumes more memory)
// propvec.insert(
// propvec.end(),
// std::make_move_iterator(locvec.begin()),
// std::make_move_iterator(locvec.end())
// );
// Append local vector to propvec (faster)
auto it = locvec.begin();
auto it2 = locvec.end();
for (; it != it2; ++it)
propvec.emplace_back(std::move(*it));
}
}
Previously, vector reduce took greater than 85 seconds. I refactored reduce_vec to do a move. Like before, this reduces the vector in-place. The new version (see Gist URL below) does three stages to reclaim memory in parallel.
// Reduce a vector range (tally adjacent count fields of duplicate key
+ names)
static void reduce_vec(auto& vec)
{
if (vec.size() > 0) {
auto it1 = vec.begin(); auto itr = it1; auto itw = it1;
auto it2 = vec.end();
auto curr = itr;
for (++itr; itr != it2; ++itr) {
if (itr->first == curr->first) {
curr->second += itr->second;
}
else {
if (itw != curr)
*itw = std::move(*curr);
++itw;
curr = itr;
}
}
if (itw != curr)
*itw = std::move(*curr);
vec.resize(std::distance(it1, ++itw));
}
}
Completion
The final versions llil4map_next.cc and llil4vec_next.cc live in a GitHub Gist, with a README file and everything needed to run.
Gregory Popovitch got me back on track. He provided a new type string_cnt.h. I made two additional variants.
string_cnt.h - original + key length stored with count
string_cnt_f.h - fixed-length-only version
string_cnt_u.h - union std::array
The map/vec programs support small and long strings automatically (without recompiling), using string_cnt.h or string_cnt_u.h. The latter involves a union containing a std::array alongside the data. For small strings, better performance is reached using the array.
Finally, I decreased the gap between total time and real time by reclaiming memory faster.
// Free dynamically allocated memory in parallel.
#ifndef FIXED_LENGTH_ONLY
#pragma omp parallel for schedule(dynamic, 300000) num_threads(nthd
+s)
for (size_t i = 0; i < propvec.size(); i++)
propvec[i].free();
#endif
I learned a lot during this journey. The challenge involved so much learning.
| [reply] [d/l] [select] |
|
I made an attempt at dynamic buffer allocation in string_cnt_ub.h. The updated Gist at GitHub does block allocations versus millions of tiny allocations (per each long string). The times were taken on an AMD Ryzen Threadripper 3970X machine with the input files already in FS cache i.e. captured results from the 2nd run. Also, Linux transparent hugepages is set to always and L1 Stream HW Prefetcher disabled in BIOS > Advanced > AMD CBS > CPU Common Options > Prefetcher settings.
$ cat /sys/kernel/mm/transparent_hugepage/enabled
[always] madvise never
2024-07 update: (1) Increased buffer size in string_cnt_ub.h. (2) Enhanced get_properties in llil4map_buf2.cc. Per thread batch_size implementation, inspired by Gregory Popovitch's WIP llil.cc variant involving queues.
llil4map
llil4map start
use OpenMP ( old ) ( next ) ( buf ) (2024-07)
memory consumption 36.6 GB 17.5 GB 16.0 GB 15.0 GB
use boost sort
get properties 12.967s 11.527s 10.402s 7.705s
map to vector 3.244s 1.252s 1.211s 0.743s
vector stable sort 10.265s 8.708s 8.161s 6.546s
write stdout 1.410s 1.388s 1.386s 2.273s
total time 27.887s 22.877s 21.163s 17.268s
count lines 970195200
count unique 295755152
29612263 5038456270
real time 0m43.973s 0m24.872s 0m21.909s 0m17.683s
user 25m08.532s 22m58.120s 20m56.492s 16m48.601s
sys 0m55.602s 0m48.783s 0m11.551s 0m08.647s
Note: 2024-07 llil4map_buf2.cc
The times above are from binaries compiled using clang++ 17.0.6. Check also, g++. For long strings, get properties and output may run faster using g++ 14.1.1.
$ ./llil4map_buf2 long* long* long* | cksum
llil4map start
use OpenMP
use boost sort
get properties 7.322 secs
map to vector 0.763 secs
vector stable sort 6.558 secs
write stdout 1.893 secs
total time 16.538 secs
count lines 970195200
count unique 295755152
29612263 5038456270
llil4vec
llil4vec start
use OpenMP ( old ) ( next ) ( buf ) (2024-07)
memory consumption 62.4 GB 33.0 GB 29.2 GB 26.8 GB
use boost sort
get properties 32.659s 12.993s 12.186s 7.182s
sort properties 39.399s 33.421s 22.713s 20.518s
vector reduce 75.644s 35.568s 33.045s 23.868s
vector stable sort 31.133s 21.987s 19.845s 14.556s
write stdout 3.206s 3.693s 1.423s 3.108s
total time 182.042s 107.664s 89.214s 69.235s
count lines 970195200
count unique 295755152
29612263 5038456270
real time 3m38.118s 1m50.092s 1m30.368s 1m09.791s
user 70m13.776s 45m32.094s 47m28.779s 38m18.296s
sys 2m29.584s 2m20.917s 1m10.917s 0m59.318s
Ditto, g++ 14.1.1 results.
$ ./llil4vec_buf long* long* long* | cksum
llil4vec start
use OpenMP
use boost sort
get properties 7.758 secs
sort properties 19.951 secs
vector reduce 19.847 secs
vector stable sort 13.695 secs
write stdout 1.985 secs
total time 63.237 secs
count lines 970195200
count unique 295755152
29612263 5038456270
| [reply] [d/l] [select] |
|
I updated Re^2: [OT] The Long List is Long resurrected with 2024-07 results, improving performance and reducing memory consumption.
Updates:
Integrated per thread batch_size logic improving get_properties in a new variant named llil4map_buf2.cc. The effect is lesser mutex/waits populating the map. Thanks, Gregory Popovitch.
Increased the buffer size in string_cnt_ub.h improving sorting, vector reduce, and vector stable sort.
| [reply] [d/l] [select] |
Re: [OT] The Long List is Long resurrected
by cavac (Prior) on Aug 09, 2024 at 14:17 UTC
|
I finally had a few minutes of time thinking about the LLIL problem and looked into PostgreSQL for a solution. For timing reference, is use the data generation and perl reference solution from eyepopslikeamosquitos post Rosetta Code: Long List is Long. With the 3 files, the reference script takes nearly 90 seconds on my puny work laptop:
$ time perl llil.pl big1.txt big2.txt big3.txt > big.tmp
llil start
get_properties : 6 secs
sort + output : 78 secs
total : 84 secs
real 1m28,236s
user 1m27,327s
sys 0m0,816s
llil is treated as a one-time problem, so my PostgreSQL solution is certainly not ideal. But if this is thought of as a problem of "these are logs and they grow over time", this is certainly something to consider.
The working directory needs to be read/writeable by the PostgreSQL server process. Then we can bulk import/export the data with the COPY command. Here is the SQL script:
CREATE TABLE llil_raw (
name text,
count bigint
);
-- INDEX makes it *way* slower
--CREATE INDEX llil_raw_idx ON llil_raw(name);
COPY llil_raw (name, count) FROM '/home/cavac/src/long_list_is_long/bi
+g1.txt' ( FORMAT TEXT);
COPY llil_raw (name, count) FROM '/home/cavac/src/long_list_is_long/bi
+g2.txt' ( FORMAT TEXT);
COPY llil_raw (name, count) FROM '/home/cavac/src/long_list_is_long/bi
+g3.txt' ( FORMAT TEXT);
CREATE TABLE llil_result (
name text,
count bigint
);
INSERT INTO llil_result (name, count)
SELECT name, sum(count) AS total FROM llil_raw GROUP BY name ORDER
+ BY total;
COPY llil_result (name, count) TO '/home/cavac/src/long_list_is_long/r
+esult.txt' ( FORMAT TEXT);
DROP TABLE llil_result;
DROP TABLE llil_raw;
And the result:
$ time psql -U Earthrise_Server -d Test_DB -f llil.sql
CREATE TABLE
COPY 3515200
COPY 3515200
COPY 3515200
CREATE TABLE
INSERT 0 10367359
COPY 10367359
DROP TABLE
DROP TABLE
real 0m19,675s
user 0m0,022s
sys 0m0,009s
Yes, it's a lot slower than the optimized one-time solutions. But in practice, as i imagine, the data is probably produced over time, a database could keep the running aggregate in llil_result up-to-date on every insert, without having to parse gazillion files every time. And it would be way more flexible, as soon as you need filtering or match/mash it with other data.
I also didn't do any optimization or parallelization (inherited subtables with exclusive primary key ranges and/or other partitioning tricks).
What i DID learn in this experiment is that COPY operations can be blazingly fast for larger datasets, expecially compared to INSERTs. (Most of the time spent here was the INSERT INTO SELECT FROM generation of the result table).
| [reply] [d/l] [select] |
|
INSERT INTO llil_result (name, count)
SELECT name, sum(count) AS total FROM llil_raw GROUP BY name ORDER
+ BY total desc, name;
Results from Linux installed on a USB 3.0 drive.
$ /usr/bin/postgres -D /var/lib/pgsql/data -h localhost
$ time psql -d test_db -f llil.sql
CREATE TABLE
COPY 3515200
COPY 3515200
COPY 3515200
CREATE TABLE
INSERT 0 10367603
COPY 10367603
DROP TABLE
DROP TABLE
real 0m52.508s
user 0m0.001s
sys 0m0.001s
Results from the DB residing in memory, /tmp/data location.
$ /usr/bin/postgres -D /tmp/data -h localhost
$ time psql -d test_db -f llil.sql
CREATE TABLE
COPY 3515200
COPY 3515200
COPY 3515200
CREATE TABLE
INSERT 0 10367603
COPY 10367603
DROP TABLE
DROP TABLE
real 0m49.396s
user 0m0.001s
sys 0m0.002s
See also, SQLite solution. The results there were captured on Clear Linux. Here, on Ubuntu Linux 24.04. The Sort::Packed module is used for sorting the output.
$ perl -I ~/perl5/lib/perl5/ llilsql.pl --threads=1 big{1,2,3}.txt | c
+ksum
Tie::Hash::DBD SQLite database - start
fixed string length=12, threads=1, maps=32
get properties : 41.352 secs
pack properties : 5.897 secs
sort packed data : 1.164 secs
write stdout : 3.602 secs
total time : 52.017 secs
count lines : 10545600
count unique : 10367603
2956888413 93308427
$ perl -I ~/perl5/lib/perl5/ llilsql.pl --threads=8 big{1,2,3}.txt | c
+ksum
Tie::Hash::DBD SQLite database - start
fixed string length=12, threads=8, maps=32
get properties : 5.572 secs
pack properties : 0.856 secs
sort packed data : 1.173 secs
write stdout : 0.719 secs
total time : 8.323 secs
count lines : 10545600
count unique : 10367603
2956888413 93308427
$ perl -I ~/perl5/lib/perl5/ llilsql.pl --threads=16 --maps=64 big{1,2
+,3}.txt | cksum
Tie::Hash::DBD SQLite database - start
fixed string length=12, threads=16, maps=64
get properties : 3.140 secs
pack properties : 0.533 secs
sort packed data : 1.175 secs
write stdout : 0.395 secs
total time : 5.247 secs
count lines : 10545600
count unique : 10367603
2956888413 93308427
| [reply] [d/l] [select] |
|
|