I made an attempt at dynamic buffer allocation in string_cnt_ub.h. The updated Gist at GitHub does block allocations versus millions of tiny allocations (per each long string). The times were taken on an AMD Ryzen Threadripper 3970X machine with the input files already in FS cache i.e. captured results from the 2nd run. Also, Linux transparent hugepages is set to always and L1 Stream HW Prefetcher disabled in BIOS > Advanced > AMD CBS > CPU Common Options > Prefetcher settings.
$ cat /sys/kernel/mm/transparent_hugepage/enabled
[always] madvise never
2024-07 update: (1) Increased buffer size in string_cnt_ub.h. (2) Enhanced get_properties in llil4map_buf2.cc. Per thread batch_size implementation, inspired by Gregory Popovitch's WIP llil.cc variant involving queues.
llil4map
llil4map start
use OpenMP ( old ) ( next ) ( buf ) (2024-07)
memory consumption 36.6 GB 17.5 GB 16.0 GB 15.0 GB
use boost sort
get properties 12.967s 11.527s 10.402s 7.705s
map to vector 3.244s 1.252s 1.211s 0.743s
vector stable sort 10.265s 8.708s 8.161s 6.546s
write stdout 1.410s 1.388s 1.386s 2.273s
total time 27.887s 22.877s 21.163s 17.268s
count lines 970195200
count unique 295755152
29612263 5038456270
real time 0m43.973s 0m24.872s 0m21.909s 0m17.683s
user 25m08.532s 22m58.120s 20m56.492s 16m48.601s
sys 0m55.602s 0m48.783s 0m11.551s 0m08.647s
Note: 2024-07 llil4map_buf2.cc
The times above are from binaries compiled using clang++ 17.0.6. Check also, g++. For long strings, get properties and output may run faster using g++ 14.1.1.
$ ./llil4map_buf2 long* long* long* | cksum
llil4map start
use OpenMP
use boost sort
get properties 7.322 secs
map to vector 0.763 secs
vector stable sort 6.558 secs
write stdout 1.893 secs
total time 16.538 secs
count lines 970195200
count unique 295755152
29612263 5038456270
llil4vec
llil4vec start
use OpenMP ( old ) ( next ) ( buf ) (2024-07)
memory consumption 62.4 GB 33.0 GB 29.2 GB 26.8 GB
use boost sort
get properties 32.659s 12.993s 12.186s 7.182s
sort properties 39.399s 33.421s 22.713s 20.518s
vector reduce 75.644s 35.568s 33.045s 23.868s
vector stable sort 31.133s 21.987s 19.845s 14.556s
write stdout 3.206s 3.693s 1.423s 3.108s
total time 182.042s 107.664s 89.214s 69.235s
count lines 970195200
count unique 295755152
29612263 5038456270
real time 3m38.118s 1m50.092s 1m30.368s 1m09.791s
user 70m13.776s 45m32.094s 47m28.779s 38m18.296s
sys 2m29.584s 2m20.917s 1m10.917s 0m59.318s
Ditto, g++ 14.1.1 results.
$ ./llil4vec_buf long* long* long* | cksum
llil4vec start
use OpenMP
use boost sort
get properties 7.758 secs
sort properties 19.951 secs
vector reduce 19.847 secs
vector stable sort 13.695 secs
write stdout 1.985 secs
total time 63.237 secs
count lines 970195200
count unique 295755152
29612263 5038456270