Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

Re: [OT] The Long List is Long resurrected

by marioroy (Prior)
on Apr 16, 2024 at 04:51 UTC ( [id://11158879]=note: print w/replies, xml ) Need Help??


in reply to [OT] The Long List is Long resurrected

The journey ended wonderfully.

Below, the gap between total time and real time is the GC time by the OS i.e. freeing dynamically allocated memory. C++ strings are dynamically allocated if greater than the small / Short String Optimization (SSO) limit; 16.

llil4map

$ time ./llil4map long* long* long* | cksum llil4map start use OpenMP ( old ) ( new ) memory consumption 36.6 GB 17.5 GB use boost sort get properties 12.967s 11.836s map to vector 3.244s 1.292s vector stable sort 10.265s 11.025s write stdout 1.410s 1.364s total time 27.887s 25.519s count lines 970195200 count unique 295755152 29612263 5038456270 real time 0m43.973s 0m27.461s user 25m08.532s 20m02.479s sys 0m55.602s 0m43.250s

llil4vec

$ time ./llil4vec long* long* long* | cksum llil4vec start use OpenMP ( old ) (mem-fix) ( new ) memory consumption 89.6 GB 62.4 GB 33.0 GB use boost sort get properties 48.133s 32.659s 12.993s sort properties 39.684s 39.399s 33.421s vector reduce 87.982s 75.644s 35.568s vector stable sort 25.990s 31.133s 21.987s write stdout 3.707s 3.206s 3.693s total time 205.498s 182.042s 107.664s count lines 970195200 count unique 295755152 29612263 5038456270 real time 3m57.228s 3m38.118s 1m50.092s user 73m27.931s 70m13.776s 45m32.094s sys 2m42.941s 2m29.584s 2m20.917s

Discoveries

I found the issue with the vector variant consuming huge memory. The std::make_move_iterator is making a copy, not move.

#pragma omp parallel for schedule(static, 1) reduction(+:num_lines) for (int i = 0; i < nfiles; ++i) { vec_str_int_type locvec; num_lines += get_properties(fname[i], locvec); #pragma omp critical { // Append local vector to propvec (consumes more memory) // propvec.insert( // propvec.end(), // std::make_move_iterator(locvec.begin()), // std::make_move_iterator(locvec.end()) // ); // Append local vector to propvec (faster) auto it = locvec.begin(); auto it2 = locvec.end(); for (; it != it2; ++it) propvec.emplace_back(std::move(*it)); } }

Previously, vector reduce took greater than 85 seconds. I refactored reduce_vec to do a move. Like before, this reduces the vector in-place. The new version (see Gist URL below) does three stages to reclaim memory in parallel.

// Reduce a vector range (tally adjacent count fields of duplicate key + names) static void reduce_vec(auto& vec) { if (vec.size() > 0) { auto it1 = vec.begin(); auto itr = it1; auto itw = it1; auto it2 = vec.end(); auto curr = itr; for (++itr; itr != it2; ++itr) { if (itr->first == curr->first) { curr->second += itr->second; } else { if (itw != curr) *itw = std::move(*curr); ++itw; curr = itr; } } if (itw != curr) *itw = std::move(*curr); vec.resize(std::distance(it1, ++itw)); } }

Completion

The final versions llil4map_next.cc and llil4vec_next.cc live in a GitHub Gist, with a README file and everything needed to run.

Gregory Popovitch got me back on track. He provided a new type string_cnt.h. I made two additional variants.

string_cnt.h - original + key length stored with count string_cnt_f.h - fixed-length-only version string_cnt_u.h - union std::array

The map/vec programs support small and long strings automatically (without recompiling), using string_cnt.h or string_cnt_u.h. The latter involves a union containing a std::array alongside the data. For small strings, better performance is reached using the array.

Finally, I decreased the gap between total time and real time by reclaiming memory faster.

// Free dynamically allocated memory in parallel. #ifndef FIXED_LENGTH_ONLY #pragma omp parallel for schedule(dynamic, 300000) num_threads(nthd +s) for (size_t i = 0; i < propvec.size(); i++) propvec[i].free(); #endif

I learned a lot during this journey. The challenge involved so much learning.

Replies are listed 'Best First'.
Re^2: [OT] The Long List is Long resurrected
by marioroy (Prior) on Apr 23, 2024 at 14:56 UTC

    I made an attempt at dynamic buffer allocation in string_cnt_ub.h. The updated Gist at GitHub does block allocations versus millions of tiny allocations (per each long string). The times were taken on an AMD Ryzen Threadripper 3970X machine with the input files already in FS cache i.e. captured results from the 2nd run. Also, Linux transparent hugepages is set to always and L1 Stream HW Prefetcher disabled in BIOS > Advanced > AMD CBS > CPU Common Options > Prefetcher settings.

    $ cat /sys/kernel/mm/transparent_hugepage/enabled [always] madvise never

    2024-07 update: (1) Increased buffer size in string_cnt_ub.h. (2) Enhanced get_properties in llil4map_buf2.cc. Per thread batch_size implementation, inspired by Gregory Popovitch's WIP llil.cc variant involving queues.

    llil4map

    llil4map start use OpenMP ( old ) ( next ) ( buf ) (2024-07) memory consumption 36.6 GB 17.5 GB 16.0 GB 15.0 GB use boost sort get properties 12.967s 11.527s 10.402s 7.705s map to vector 3.244s 1.252s 1.211s 0.743s vector stable sort 10.265s 8.708s 8.161s 6.546s write stdout 1.410s 1.388s 1.386s 2.273s total time 27.887s 22.877s 21.163s 17.268s count lines 970195200 count unique 295755152 29612263 5038456270 real time 0m43.973s 0m24.872s 0m21.909s 0m17.683s user 25m08.532s 22m58.120s 20m56.492s 16m48.601s sys 0m55.602s 0m48.783s 0m11.551s 0m08.647s Note: 2024-07 llil4map_buf2.cc

    The times above are from binaries compiled using clang++ 17.0.6. Check also, g++. For long strings, get properties and output may run faster using g++ 14.1.1.

    $ ./llil4map_buf2 long* long* long* | cksum llil4map start use OpenMP use boost sort get properties 7.322 secs map to vector 0.763 secs vector stable sort 6.558 secs write stdout 1.893 secs total time 16.538 secs count lines 970195200 count unique 295755152 29612263 5038456270

    llil4vec

    llil4vec start use OpenMP ( old ) ( next ) ( buf ) (2024-07) memory consumption 62.4 GB 33.0 GB 29.2 GB 26.8 GB use boost sort get properties 32.659s 12.993s 12.186s 7.182s sort properties 39.399s 33.421s 22.713s 20.518s vector reduce 75.644s 35.568s 33.045s 23.868s vector stable sort 31.133s 21.987s 19.845s 14.556s write stdout 3.206s 3.693s 1.423s 3.108s total time 182.042s 107.664s 89.214s 69.235s count lines 970195200 count unique 295755152 29612263 5038456270 real time 3m38.118s 1m50.092s 1m30.368s 1m09.791s user 70m13.776s 45m32.094s 47m28.779s 38m18.296s sys 2m29.584s 2m20.917s 1m10.917s 0m59.318s

    Ditto, g++ 14.1.1 results.

    $ ./llil4vec_buf long* long* long* | cksum llil4vec start use OpenMP use boost sort get properties 7.758 secs sort properties 19.951 secs vector reduce 19.847 secs vector stable sort 13.695 secs write stdout 1.985 secs total time 63.237 secs count lines 970195200 count unique 295755152 29612263 5038456270

      I updated Re^2: [OT] The Long List is Long resurrected with 2024-07 results, improving performance and reducing memory consumption.

      Updates:

      Integrated per thread batch_size logic improving get_properties in a new variant named llil4map_buf2.cc. The effect is lesser mutex/waits populating the map. Thanks, Gregory Popovitch.

      Increased the buffer size in string_cnt_ub.h improving sorting, vector reduce, and vector stable sort.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11158879]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others surveying the Monastery: (4)
As of 2025-02-14 14:59 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found