Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

Re^2: Rosetta Code: Long List is Long (faster - vec)

by eyepopslikeamosquito (Archbishop)
on Jan 09, 2023 at 06:02 UTC ( [id://11149443]=note: print w/replies, xml ) Need Help??


in reply to Re: Rosetta Code: Long List is Long (faster)
in thread Rosetta Code: Long List is Long

J script runs in ~5.7 sec using ~650 MB to produce exactly same output. Which makes it fastest of all solutions so far.

The good news is that I've now found two different C++ versions that are faster. The bad news is that they're much uglier than my original llil2grt.cpp (which ran in about 6.2 seconds on Ubuntu). After getting nowhere trying to speed up llil2grt.cpp, I reluctantly decided that the simple and elegant hash_ret[word] -= count had to go.

First the timings on Ubuntu. Because there's quite a bit of natural variation between runs, I did three runs of each:

llil2vec start get_properties CPU time : 3.06313 secs emplace set sort CPU time : 0.923435 secs write stdout CPU time : 1.392 secs total CPU time : 5.37868 secs total wall clock time : 6 secs llil2vec start get_properties CPU time : 3.1567 secs emplace set sort CPU time : 0.970294 secs write stdout CPU time : 1.22305 secs total CPU time : 5.35015 secs total wall clock time : 5 secs llil2vec start get_properties CPU time : 3.32019 secs emplace set sort CPU time : 1.08277 secs write stdout CPU time : 1.22461 secs total CPU time : 5.62766 secs total wall clock time : 5 secs

Ave CPU time: 5.5 secs (Memory use (Windows Private Bytes): 1,225,580K)

llil2vec (fixed string length=6) start get_properties CPU time : 2.09353 secs emplace set sort CPU time : 0.795144 secs write stdout CPU time : 1.20994 secs total CPU time : 4.09871 secs total wall clock time : 4 secs llil2vec (fixed string length=6) start get_properties CPU time : 2.2078 secs emplace set sort CPU time : 0.707252 secs write stdout CPU time : 1.14867 secs total CPU time : 4.06383 secs total wall clock time : 4 secs llil2vec (fixed string length=6) start get_properties CPU time : 2.39225 secs emplace set sort CPU time : 1.0033 secs write stdout CPU time : 1.22765 secs total CPU time : 4.62331 secs total wall clock time : 4 secs

Ave CPU time: 4.3 secs (Memory use (Windows Private Bytes): 814,940K)

The first one still uses std::string so there are no limitations on word length. The second one is limited to a word length of six characters, and so works with big1.txt, while crashing spectacularly with long1.txt.

Funny, I'd originally forgotten that way back in 2014, when desperately trying to make a search program run a hundred times faster, I'd stumbled upon this classic quote:

That's not a minor disadvantage, we're talking about things being 50 to 100 times slower with a linked list. Compactness matters. Vectors are more compact than lists. And predictable usage patterns matter enormously. With a vector, you have to shove a lot of elements over, but caches are really really good at that.

When you traverse lists you keep doing random access ... so you are actually random accessing your memory and you are maximizing your cache misses, which is exactly the opposite of what you want.

-- Bjarne Stroustrup: Why you should avoid Linked Lists (3:30-5:00) (update: see also)

Back then, because each memory access into my (huge) 4GB lookup tables was essentially random, most memory accesses missed the L1 cache, missed the L2 cache, missed the L3 cache, then waited for the cache line to be loaded into all three caches from main memory, while often incurring a TLB cache miss, just to rub salt into the wounds.

Thankfully, there are no 4 GB lookup tables in this problem. But hashes (and linked lists) still tend to be mind-bogglingly slower than vectors on modern hardware, as indicated by Stroustrup's quote above. So I reluctantly decided that the simple and elegant hash_ret[word] -= count had to go.

Though I really hate this new std::vector based solution, it does run faster than my earlier std::map based one. This is just a first attempt, further improvements are possible. The llil2vec.cpp source code is shown below. Though there are two sets of timings, there is only one program, compiled with or without MAX_STR_LEN_L defined. Suggestions for improvements welcome.

llil2vec.cpp

// llil2vec.cpp. // Vector version of llil2grt.pl. // g++ compile on Linux: // g++ -o llil2vec -std=c++11 -Wall -O3 llil2vec.cpp // This g++ command also works with mingw C++ compiler (https://source +forge.net/projects/mingw-w64) // that comes bundled with Strawberry Perl (C:\Strawberry\c\bin\g++.ex +e). // Example run: llil2vec big1.txt big2.txt big3.txt >vec.tmp #include <cstddef> #include <cstdint> #include <cstdlib> #include <cstring> #include <ctime> #include <cstdio> #include <string> #include <array> #include <vector> #include <set> #include <algorithm> #include <utility> #include <iterator> #include <iostream> #include <fstream> #include <sstream> static_assert(sizeof(size_t) == sizeof(int64_t), "size_t too small, ne +ed a 64-bit compile"); // ------------------------------------------------------------------- +--------- // Crude hack to see Windows Private Bytes in Task Manager by sleeping + at // program end (see also sleep hack at end of main) // #include <chrono> // #include <thread> // ------------------------------------------------------------------- +--------- typedef long long llil_int_type; // Note: all words in big1.txt, big2.txt, big3.txt are <= 6 chars in l +ength // To use (limited length) fixed length strings uncomment the next lin +e // #define MAX_STR_LEN_L 6 #ifdef MAX_STR_LEN_L using str_arr_type = std::array<char, MAX_STR_LEN_L + 1>; // +1 + for trailing '\0' using str_int_type = std::pair<str_arr_type, llil_int_type>; using int_str_type = std::pair<llil_int_type, str_arr_type>; #else using str_int_type = std::pair<std::string, llil_int_type>; using int_str_type = std::pair<llil_int_type, std::string>; #endif using vec_str_int_type = std::vector<str_int_type>; using vec_int_str_type = std::vector<int_str_type>; using set_int_str_type = std::set<int_str_type>; // Mimic the Perl get_properties subroutine -------------------------- +-- // Limit line length and use ANSI C functions to try to boost performa +nce #define MAX_LINE_LEN_L 255 static void get_properties( int nfiles, // in: the number of input files char* fname[], // in: the input file names vec_int_str_type& vec_ret) // out: a vector of properties { FILE* fh; char line[MAX_LINE_LEN_L+1]; char* word; llil_int_type count; for (int i = 0; i < nfiles; ++i) { fh = ::fopen(fname[i], "r"); if (fh == NULL) { std::cerr << "Error opening '" << fname[i] << "' : errno=" << + errno << "\n"; continue; } while ( ::fgets(line, MAX_LINE_LEN_L, fh) != NULL ) { word = ::strtok(line, "\t"); count = ::atoll( ::strtok(NULL, "\n") ); #ifdef MAX_STR_LEN_L str_arr_type fixword { { '\0', '\0', '\0', '\0', '\0', '\0', +'\0' } }; ::strcpy( fixword.data(), word ); vec_ret.emplace_back( -count, fixword ); #else vec_ret.emplace_back( -count, word ); #endif } ::fclose(fh); } // Needs to be sorted by word for later sum of adjacent count field +s to work std::sort( vec_ret.begin(), vec_ret.end(), [](const int_str_type& left, const int_str_type& right) { return + left.second < right.second; } ); } // ------------------------------------------------------------------- +-- int main(int argc, char* argv[]) { if (argc < 2) { std::cerr << "usage: llil2vec file1 file2 ... >out.txt\n"; return 1; } #ifdef MAX_STR_LEN_L std::cerr << "llil2vec (fixed string length=" << MAX_STR_LEN_L << " +) start\n"; #else std::cerr << "llil2vec start\n"; #endif time_t tstart1 = ::time(NULL); clock_t cstart1 = ::clock(); // Create the vector of properties vec_int_str_type vec_ret; get_properties(argc - 1, &argv[1], vec_ret); clock_t cend1 = ::clock(); double ctaken1 = (double) (cend1 - cstart1) / (double)CLOCKS_PER_SE +C; std::cerr << "get_properties CPU time : " << ctaken1 << " secs +\n"; clock_t cstart2 = ::clock(); // To avoid calling sort(), create an inverted std::set container // Note: negative count gives desired ordering set_int_str_type myset; auto it = vec_ret.begin(); int_str_type kv_last = *it; llil_int_type count = it->first; for (++it; it != vec_ret.end(); ++it) { if ( it->second == kv_last.second ) { count += it->first; } else { myset.emplace_hint( myset.end(), count, kv_last.second ); kv_last = *it; count = it->first; } } myset.emplace_hint( myset.end(), count, kv_last.second ); clock_t cend2s = ::clock(); // Output the (already sorted) std::set - no sort() function requir +ed // Note: fix up negative count via -n.first #ifdef MAX_STR_LEN_L for ( auto const& n : myset ) std::cout << n.second.data() << '\t' +<< -n.first << '\n'; #else for ( auto const& n : myset ) std::cout << n.second << '\t' +<< -n.first << '\n'; #endif clock_t cend2 = ::clock(); time_t tend2 = ::time(NULL); long ttaken = static_cast<long>(::difftime(tend2, tstart1) + 0 +.5); double ctaken = (double) (cend2 - cstart1) / (double)CLOCKS_PER_ +SEC; double ctaken2s = (double) (cend2s - cstart2) / (double)CLOCKS_PER +_SEC; double ctaken2o = (double) (cend2 - cend2s) / (double)CLOCKS_PER +_SEC; std::cerr << "emplace set sort CPU time : " << ctaken2s << " sec +s\n"; std::cerr << "write stdout CPU time : " << ctaken2o << " sec +s\n"; std::cerr << "total CPU time : " << ctaken << " sec +s\n"; std::cerr << "total wall clock time : " << ttaken << " sec +s\n"; // Hack to see Private Bytes in Windows Task Manager (uncomment nex +t line so process doesn't exit too quickly) // std::this_thread::sleep_for(std::chrono::milliseconds(90000000 +)); return 0; }

Replies are listed 'Best First'.
Re^3: Rosetta Code: Long List is Long (faster - vec)
by marioroy (Prior) on Jan 09, 2023 at 10:31 UTC

    For fixed string length (I ran several times), clang++ computes ~ 0.3 seconds faster compared to g++. See also, the J script result with AVX2 enabled (total time 3.48103 secs).

    # gcc 12.2.1 $ g++ -o llil2vec -std=c++11 -Wall -O3 -march=native -mtune=skylake -f +align-functions=32 -fno-semantic-interposition -mno-vzeroupper -mpref +er-vector-width=256 llil2vec.cpp $ ./llil2vec big1.txt big2.txt big3.txt >out.txt llil2vec (fixed string length=6) start get_properties CPU time : 1.89609 secs emplace set sort CPU time : 0.544972 secs write stdout CPU time : 0.842451 secs total CPU time : 3.28355 secs total wall clock time : 4 secs # clang version 15.0.4 $ clang++ -o llil2vec -std=c++11 -Wall -O3 -march=native -mtune=skylak +e -falign-functions=32 -fno-semantic-interposition -mno-vzeroupper -m +prefer-vector-width=256 llil2vec.cpp $ ./llil2vec big1.txt big2.txt big3.txt >out.txt llil2vec (fixed string length=6) start get_properties CPU time : 1.67073 secs emplace set sort CPU time : 0.474207 secs write stdout CPU time : 0.828457 secs total CPU time : 2.97343 secs total wall clock time : 3 secs

      Thanks! Yes, I'm seeing clang++ slightly faster too, but only for the limited length fixed string case.

      Funny, I'd earlier tried clang++ for the non fixed string case, but it seemed to be very slightly slower than g++. That result, combined with google suggesting that g++ usually produces slightly faster executables caused me to give up on clang++.

      I also fiddled with some of the many compiler parameters but felt overwhelmed by the sheer number and complexity of them, so just stuck to the basic ones for now. The natural variations in timing of each run also make it hard to be certain.

      After googling, I was thinking of something like:

      #!/bin/sh # Clear the caches (this needs root permissions) sync; echo 3 > /proc/sys/vm/drop_caches # Use taskset for cpu affinity (but which cpu to choose?) taskset 0x00000002 ./llil2vec big1.txt big2.txt big3.txt >f.tmp sleep 5 taskset 0x00000002 ./llil2vec big1.txt big2.txt big3.txt >f.tmp sleep 5 taskset 0x00000002 ./llil2vec big1.txt big2.txt big3.txt >f.tmp
      but was too gutless, given it needs root permissions and I didn't really know what I was doing.

        Yes, I'm seeing clang++ slightly faster too, but only for the limited length fixed string case.

        Same here.

        I also fiddled with some of the many compiler parameters but felt overwhelmed by the sheer number and complexity of them, so just stuck to the basic ones for now.

        I tried various parameters not realizing no improvements. The following performs similarly, without the extra CFLAGS.

        $ clang++ -o llil2vec -std=c++11 -Wall -O3 llil2vec.cpp $ ./llil2vec big1.txt big2.txt big3.txt >out.txt llil2vec (fixed string length=6) start get_properties CPU time : 1.65488 secs emplace set sort CPU time : 0.470765 secs write stdout CPU time : 0.850356 secs total CPU time : 2.97605 secs total wall clock time : 3 secs
Re^3: Rosetta Code: Long List is Long (faster - vec - openmp)
by eyepopslikeamosquito (Archbishop) on Jan 10, 2023 at 12:12 UTC

    Thanks to the excellent feedback from marioroy I've been able to improve the code as follows:

    • Compile MAX_STR_LEN_L version with clang++, otherwise g++
    • For MAX_STR_LEN_L version, reduced size of short strings by one char by not null-terminating them
    • Use good old printf to write to stdout (seems a bit faster than streaming to std::cout)
    • Minor code cleanups (especially around creating the inverted std::set)

    New Timings for Long String Version

    > ./llil2vec big1.txt big2.txt big3.txt >f.tmp llil2vec start get_properties CPU time : 3.06708 secs emplace set sort CPU time : 0.903617 secs write stdout CPU time : 1.23459 secs total CPU time : 5.20544 secs total wall clock time : 5 secs > ./llil2vec big1.txt big2.txt big3.txt >f.tmp llil2vec start get_properties CPU time : 3.4832 secs emplace set sort CPU time : 1.10209 secs write stdout CPU time : 0.939805 secs total CPU time : 5.52519 secs total wall clock time : 6 secs > ./llil2vec big1.txt big2.txt big3.txt >f.tmp llil2vec start get_properties CPU time : 3.42605 secs emplace set sort CPU time : 0.916222 secs write stdout CPU time : 1.00049 secs total CPU time : 5.343 secs total wall clock time : 5 secs > diff f.tmp vec.tmp

    This is an average time of 5.35 secs (compared to 5.5 secs in the original post).

    New Timings for Short String (MAX_STR_LEN_L=6) Version

    > For some reason, short string version is faster when compiled with c +lang++ > (while long string version is not) > clang++ -o llil2vec -std=c++11 -Wall -O3 llil2vec.cpp > ./llil2vec big1.txt big2.txt big3.txt >f.tmp llil2vec (fixed string length=6) start get_properties CPU time : 1.87217 secs emplace set sort CPU time : 0.589238 secs write stdout CPU time : 0.842179 secs total CPU time : 3.30369 secs total wall clock time : 3 secs > ./llil2vec big1.txt big2.txt big3.txt >f.tmp llil2vec (fixed string length=6) start get_properties CPU time : 1.95909 secs emplace set sort CPU time : 0.610479 secs write stdout CPU time : 0.959859 secs total CPU time : 3.52965 secs total wall clock time : 4 secs > ./llil2vec big1.txt big2.txt big3.txt >f.tmp llil2vec (fixed string length=6) start get_properties CPU time : 1.86549 secs emplace set sort CPU time : 0.608097 secs write stdout CPU time : 0.857176 secs total CPU time : 3.33087 secs total wall clock time : 3 secs > diff f.tmp vec.tmp

    This is an average time of 3.4 secs (compared to 4.3 secs in the original post).

    New version of llil2vec.cpp

    Interim OpenMP Versions

    Note: The interim versions below are just early (conservative) openmp-safe versions made thread-safe by expunging the dreaded strtok function.

    The one with the most potential for multi-threading seems to be llil3vec.cpp because it uses vectors which can be safely used lock-free in thread-local vec_loc vectors, as shown by marioroy in llil4vec.cpp.

    While replacing std::sort and std::stable_sort with __gnu_parallel versions to sort std::vectors is a no-brainer, it seems problematic to parallelise access to a global std::map (as used in llil3grt.cpp and llil3m.cpp below) because I think you'd need some sort of locking to safely allow multiple threads update a shared global std::map. Ideas welcome.

    Interim version of llil3vec.cpp

    Timings of llil3vec

    > g++ -o llil3vec -std=c++20 -Wall -O3 llil3vec.cpp > time ./llil3vec big1.txt big2.txt big3.txt >f.tmp llil3vec (fixed string length=6) start get_properties CPU time : 1.47895 secs emplace set sort CPU time : 0.592783 secs write stdout CPU time : 0.832071 secs total CPU time : 2.90392 secs total wall clock time : 3 secs real 0m3.217s user 0m2.806s sys 0m0.412s > diff f.tmp vec.tmp > g++ -o llil3vec -std=c++20 -fopenmp -Wall -O3 llil3vec.cpp > time ./llil3vec big1.txt big2.txt big3.txt >f.tmp llil3vec (fixed string length=6) start get_properties CPU time : 2.5809 secs emplace set sort CPU time : 0.793355 secs write stdout CPU time : 0.860855 secs total CPU time : 4.23521 secs total wall clock time : 3 secs real 0m2.673s user 0m4.073s sys 0m0.465s > diff f.tmp vec.tmp

    For the -fopenmp version note that the Unix real time is lower, but the user time is higher.

    Interim version of llil3grt.cpp

    Interim version of llil3m.cpp

    Timings

    llil3grt (fixed string length=6) start - openmp version get_properties CPU time : 2.34487 secs emplace set sort CPU time : 0.942098 secs write stdout CPU time : 0.855157 secs total CPU time : 4.14223 secs total wall clock time : 4 secs real 0m4.752s user 0m4.241s sys 0m0.511s

    $ time ./llil3m big1.txt big2.txt big3.txt >f.tmp llil3m (fixed string length=6) start - openmp version get_properties CPU time : 2.40089 secs vector copy CPU time : 0.544294 secs vector stable sort CPU time : 5.84981 secs write stdout CPU time : 0.805509 secs total CPU time : 9.6006 secs total wall clock : 4 secs real 0m4.713s user 0m9.238s sys 0m0.679s

    llil3m (fixed string length=6) start - openmp version get_properties CPU time : 2.3031 secs vector copy CPU time : 0.544613 secs vector sort CPU time : 2.64047 secs write stdout CPU time : 0.791186 secs total CPU time : 6.27946 secs total wall clock : 4 secs real 0m4.182s user 0m6.128s sys 0m0.473s

    $ time ./llil3m big1.txt big2.txt big3.txt >f.tmp llil3m (fixed string length=6) start get_properties CPU time : 2.28072 secs vector copy CPU time : 0.471632 secs vector stable sort CPU time : 0.735042 secs write stdout CPU time : 0.67514 secs total CPU time : 4.16263 secs total wall clock : 4 secs real 0m4.470s user 0m4.159s sys 0m0.311s $ time ./llil3m big1.txt big2.txt big3.txt >f.tmp llil3m (fixed string length=6) start get_properties CPU time : 2.30618 secs vector copy CPU time : 0.473185 secs vector sort CPU time : 1.13081 secs write stdout CPU time : 0.668702 secs total CPU time : 4.57897 secs total wall clock : 4 secs real 0m4.889s user 0m4.558s sys 0m0.331s $ time ./llil3vec big1.txt big2.txt big3.txt >f.tmp llil3vec (fixed string length=6) start get_properties CPU time : 1.46864 secs emplace set sort CPU time : 0.630914 secs write stdout CPU time : 0.847912 secs total CPU time : 2.9476 secs total wall clock time : 3 secs real 0m3.233s user 0m2.852s sys 0m0.381s $ time ./llil3grt big1.txt big2.txt big3.txt >f.tmp llil3grt (fixed string length=6) start get_properties CPU time : 2.34418 secs emplace set sort CPU time : 0.901415 secs write stdout CPU time : 0.90025 secs total CPU time : 4.14595 secs total wall clock time : 5 secs real 0m4.784s user 0m4.232s sys 0m0.551s

    References Added Later

    Updated: The source code for llil2vec.cpp was updated after the node was first created based on feedback from the replies. Added interim version llil3vec.cpp (with some marioroy improvements added). Plus interim versions of llil3grt.cpp and llil3m.cpp. Added a few missing #include <array> and reference for OpenMP Little Book (thanks marioroy).

      I'm not seeing a noticeable difference between the initial llil2vec demonstration and this one w.r.t. total time. So, I searched the web and tried fast_io with success. It requires a C++20 capable compiler.

      git clone --depth 1 https://github.com/cppfastio/fast_io

      Add the header file above other includes and comment out the cstdio include line.

      #include <fast_io.h> //#include <cstdio>

      Specify the path to the include folder. Also C++20.

      clang++ -o llil2vec -std=c++20 -Wall -O3 llil2vec.cpp -I./fast_io/incl +ude

      Before C++11 results:

      llil2vec (fixed string length=6) start get_properties CPU time : 1.62356 secs emplace set sort CPU time : 0.451718 secs write stdout CPU time : 0.923496 secs total CPU time : 2.99881 secs total wall clock time : 3 secs

      C++20 + fast_io results:

      llil2vec (fixed string length=6) start get_properties CPU time : 1.50394 secs emplace set sort CPU time : 0.430272 secs write stdout CPU time : 0.889239 secs total CPU time : 2.82348 secs total wall clock time : 3 secs

        Thanks so much for finding this github fast_io library and for posting such clear instructions on how to use it! This made it ridiculously easy for me to play around with it.

        So far I haven't found any statistically significant speed-ups from employing this library (hardly surprising given the mega man years of effort poured into g++ and clang++), but I will keep my eye on it.

      I tried the following for the interim version of llil3m.cpp.

      The overhead for running parallel is greater than I hoped for using std::map. Then, I found parallel hashmap. From the link, "By using parallel SSE2 instructions, these hashmaps are able to look up items by checking 16 slots in parallel."

      Running:

      $ time ./llil2grt big1.txt big2.txt big3.txt >out.txt llil2grt start get_properties CPU time : 3.55473 secs emplace set sort CPU time : 0.998423 secs write stdout CPU time : 0.854453 secs total CPU time : 5.40764 secs total wall clock time : 5 secs real 0m5.973s user 0m5.697s sys 0m0.292s $ NUM_THREADS=1 ./llil4map-no6-uno big1.txt big2.txt big3.txt >out.txt llil4map start use std::unordered_map don't use OpenMP use boost sort get properties 3.368 secs finish merging 1.202 secs vector stable sort 0.898 secs write stdout 0.261 secs total time 5.730 secs $ NUM_THREADS=1 ./llil4map-no6-std big1.txt big2.txt big3.txt >out.txt llil4map start use std::map don't use OpenMP use boost sort get properties 3.328 secs finish merging 0.657 secs vector stable sort 0.816 secs write stdout 0.268 secs total time 5.069 secs $ NUM_THREADS=1 ./llil4map-no6-flat big1.txt big2.txt big3.txt >out.tx +t llil4map start use phmap::parallel_flat_hash_map don't use OpenMP use boost sort get properties 1.565 secs map to vector 0.200 secs vector stable sort 0.665 secs write stdout 0.220 secs total time 2.652 secs

      llil4map.cc

        Nice work Mario! As you might expect, I couldn't restrain myself from playing around with a few minor changes:

        • I gave boost::hash_range a try because there's less syntactic claptrap around the specialization of std::hash -- I've #if 0'ed it out though because it seems to be a touch slower than good old std::hash.
        • Generally, I pull a face whenever I see a function updating a non-const global variable, so I changed the get_properties interface to pass map_upd as a function parameter.
        • My usual simplification of parsing in get_properties by assuming the input file/s match the llil spec ^[a-z]+\t\d+$ (and without any long names when using MAX_STR_LEN :).

        // llil4map.cpp // https://perlmonks.com/?node_id=11149643 // OpenMP Little Book - https://nanxiao.gitbooks.io/openmp-little-book // // C++ version inspired by Mario's famous dualvar two-sort Perl soluti +on. // based on llil3m.cpp https://perlmonks.com/?node_id=11149482 // 1. Run get_properties in parallel. // 2. Capture time and diff via chrono. // 3. Threads: Flush local vector periodically; speed-up for std::m +ap. // 4. Key words null-terminated for MAX_STR_LEN_L. // 5. Concat strings using fast_io::concatln during output. // 6. Support NUM_THREADS environment variable. // 7. Added define for using boost's parallel sort. // 8. Removed parallel get_properties, not helpful due to overhead. // 9. Removed MAX_STR_LEN_L for simply std::map. // A. Replaced atoll with fast_atoll64. // B. Fast vector sorting - from llil5vec-tbb. // C. Re-enabled parallel get_properties. // D. Exit early if no work; fast_io tweak writing to output; // limit to 8 threads max for sorting. // E. Switch to parallel hashmap; using SSE2 instructions. // F. Support std::map, std::unordered_map, or parallel-hashmap via + define. // G. Refactored OpenMP code to merge immediately. // H. Improved get_properties; set precision for timings. // I. Support fixed-length key words using basic_static_string. // J. Remove support for std::map and std::unordered_map. // K. Opt-in to use (limited length) fixed length strings. // Uncomment line #define MAX_STR_LEN_L and update value. // L. Improved the fixed length path with custom default_hash/eq fu +nctions. // M. Define str_type struct as std::array, overriding == and < ope +rators. // // Obtain the fast_io library (required dependency): // git clone --depth=1 https://github.com/cppfastio/fast_io // // Obtain the parallel hashmap library (required dependency): // git clone --depth=1 https://github.com/greg7mdp/parallel-hashmap // // g++ compile on Linux: (boost header may need the -Wno-stringop-over +flow gcc option) // g++ -o llil4map -std=c++20 -Wall -O3 llil4map.cpp -I ./fast_io/i +nclude -I ./parallel-hashmap // g++ -o llil4map-omp -std=c++20 -fopenmp -Wall -O3 llil4map.cpp - +I ./fast_io/include -I ./parallel-hashmap // // This g++ command also works with mingw C++ compiler (https://source +forge.net/projects/mingw-w64) // that comes bundled with Strawberry Perl (C:\Strawberry\c\bin\g++.ex +e). // // clang++ compile: same args, without the -Wno-stringop-overflow opti +on // Seems to run slightly faster when compiled with clang++ instead of +g++ // // Obtain gen-llil.pl and gen-long-llil.pl from https://perlmonks.com/ +?node_id=11148681 // perl gen-llil.pl big1.txt 200 3 1 // perl gen-llil.pl big2.txt 200 3 1 // perl gen-llil.pl big3.txt 200 3 1 // perl gen-long-llil.pl long1.txt 600 // perl gen-long-llil.pl long2.txt 600 // perl gen-long-llil.pl long3.txt 600 // // To make random input, obtain shuffle.pl from https://perlmonks.com/ +?node_id=11149800 // perl shuffle.pl big1.txt >tmp && mv tmp big1.txt // perl shuffle.pl big2.txt >tmp && mv tmp big2.txt // perl shuffle.pl big3.txt >tmp && mv tmp big3.txt // // Example run: llil4map big1.txt big2.txt big3.txt >out.txt // NUM_THREADS=3 llil4map-omp ... // Note: Lesser NUM_THREADS is preferred, else merge time incr +eases // Start with MIN( CPU cores, 1 + log2( number of input +files ) ) // ------------------------------------------------------------------- +--------- // Specify 0/1 to use boost's parallel sorting algorithm; faster than +__gnu_parallel::sort. // https://www.boost.org/doc/libs/1_81_0/libs/sort/doc/html/sort/paral +lel.html // This requires the boost header files: e.g. devpkg-boost bundle on C +lear Linux. // Note: Another option is downloading and unpacking Boost locally. // (no need to build it because the bits we use are header file only) #define USE_BOOST_PARALLEL_SORT 1 #include <chrono> #include <thread> // The fast_io header must come after chrono, else build error: // "no member named 'concatln' in namespace 'fast_io'" #include <fast_io.h> #include <parallel_hashmap/phmap.h> #include <cstdio> #include <cstddef> #include <cstdint> #include <cstdlib> #include <cstring> #include <ctime> #include <string> #include <array> #include <vector> #include <utility> #include <iterator> #include <execution> #include <algorithm> #include <boost/container_hash/hash.hpp> #if USE_BOOST_PARALLEL_SORT > 0 #include <boost/sort/sort.hpp> #endif #ifdef _OPENMP #include <omp.h> #endif #include <iostream> #include <iomanip> #include <fstream> #include <sstream> static_assert(sizeof(size_t) == sizeof(int64_t), "size_t too small, ne +ed a 64-bit compile"); // ------------------------------------------------------------------- +--------- typedef long long llil_int_type; // All words in big1.txt, big2.txt, big3.txt are <= 6 chars in length. // big.txt max word length is 6 // long.txt max word length is 208 // // Based on rough benchmarking, the short fixed string hack below is o +nly // worth trying for MAX_STR_LEN_L up to about 22. // See also https://backlinko.com/google-keyword-study // // To use (limited length) fixed length strings uncomment the next lin +e. #define MAX_STR_LEN_L 6 #ifdef MAX_STR_LEN_L struct str_type : std::array<char, MAX_STR_LEN_L> { bool operator==( const str_type& o ) const { return ::memcmp(this->data(), o.data(), MAX_STR_LEN_L) == 0; } bool operator<( const str_type& o ) const { return ::memcmp(this->data(), o.data(), MAX_STR_LEN_L) < 0; } }; #else struct str_type : std::basic_string<char> { bool operator==( const str_type& o ) const { return ::strcmp(this->data(), o.data()) == 0; } bool operator<( const str_type& o ) const { return ::strcmp(this->data(), o.data()) < 0; } }; #endif using str_int_type = std::pair<str_type, llil_int_type>; using vec_str_int_type = std::vector<str_int_type>; // inject specialization of std::hash for str_type into namespace std namespace std { template<> struct hash<str_type> { std::size_t operator()( str_type const& v ) const noexcept { #if 0 return boost::hash_range( v.cbegin(), v.cend() ); #else std::basic_string_view<char> bv { reinterpret_cast<const char*>(v.data()), v.size() * sizeof +(char) }; return std::hash<std::basic_string_view<char>>()(bv); #endif } }; } // create the parallel_flat_hash_map without internal mutexes using map_str_int_type = phmap::parallel_flat_hash_map< str_type, llil_int_type, phmap::priv::hash_default_hash<str_type>, phmap::priv::hash_default_eq<str_type>, phmap::priv::Allocator<phmap::priv::Pair<const str_type, llil_int_t +ype>>, 8, phmap::NullMutex >; // fast_atoll64 // https://stackoverflow.com/questions/16826422/ // c-most-efficient-way-to-convert-string-to-int-faster-than-atoi inline int64_t fast_atoll64( const char* str ) { int64_t val = 0; // int sign = 0; // if ( *str == '-' ) { // sign = 1, ++str; // } uint8_t digit; while ((digit = uint8_t(*str++ - '0')) <= 9) val = val * 10 + digit +; // return sign ? -val : val; return val; } // Mimic the Perl get_properties subroutine -------------------------- +-- // Limit line length and use ANSI C functions to try to boost performa +nce #define MAX_LINE_LEN_L 255 #define MAX_THREADS 32 static void get_properties( const char* fname, // in : the input file name map_str_int_type& map_upd // inout: the map to be updated ) { FILE* fh; std::array<char, MAX_LINE_LEN_L + 1> line; char* found; llil_int_type count; fh = ::fopen(fname, "r"); if ( fh == NULL ) { std::cerr << "Error opening '" << fname << "' : " << strerror(er +rno) << "\n"; return; } while ( ::fgets( line.data(), static_cast<int>(MAX_LINE_LEN_L), fh +) != NULL ) { found = std::find( line.begin(), line.end(), '\t' ); count = fast_atoll64( found + 1 ); // Note: using emplace() is faster than map_upd[fixword] += coun +t; #ifdef MAX_STR_LEN_L str_type fixword {}; // {} initializes all elements of fixword +to '\0' std::copy( line.begin(), found, fixword.begin() ); const auto [it, success] = map_upd.emplace(fixword, count); #else *found = '\0'; const auto [it, success] = map_upd.emplace(str_type{ line.data() + }, count); #endif if ( !success ) it->second += count; } ::fclose(fh); } typedef std::chrono::high_resolution_clock high_resolution_clock; typedef std::chrono::high_resolution_clock::time_point time_point; typedef std::chrono::milliseconds milliseconds; double elaspe_time( time_point cend, time_point cstart) { return double ( std::chrono::duration_cast<milliseconds>(cend - cstart).count() ) * 1e-3; } // ------------------------------------------------------------------- +-- static map_str_int_type Map[MAX_THREADS]; int main(int argc, char* argv[]) { if (argc < 2) { std::cerr << "usage: llil4map file1 file2 ... >out.txt\n"; return 1; } std::cerr << std::setprecision(3) << std::setiosflags(std::ios::fix +ed); #ifdef MAX_STR_LEN_L std::cerr << "llil4map (fixed string length=" << MAX_STR_LEN_L << " +) start\n"; #else std::cerr << "llil4map start\n"; #endif #ifdef _OPENMP std::cerr << "use OpenMP\n"; #else std::cerr << "don't use OpenMP\n"; #endif #if USE_BOOST_PARALLEL_SORT == 0 std::cerr << "don't use boost sort\n"; #else std::cerr << "use boost sort\n"; #endif time_point cstart1, cend1, cstart2, cend2, cstart3, cend3s, cend3; cstart1 = high_resolution_clock::now(); #ifdef _OPENMP // Determine the number of threads. const char* env_nthrs = std::getenv("NUM_THREADS"); int nthrs = ( env_nthrs && strlen(env_nthrs) ) ? ::atoi(env_nthrs) : std::thread::hardware_concurrency(); if ( nthrs > MAX_THREADS ) nthrs = MAX_THREADS; omp_set_dynamic(false); omp_set_num_threads(nthrs); #else int nthrs = 1; #endif int nthrs_sort = ( std::thread::hardware_concurrency() < 12 ) ? std::thread::hardware_concurrency() : 12; // Get the list of input files from the command line int nfiles = argc - 1; char** fname = &argv[1]; int merge_id = 0; // Run parallel, depending on the number of threads if ( nthrs == 1 || nfiles == 1 ) { for ( int i = 0; i < nfiles; ++i ) { get_properties(fname[i], Map[0]); } cend1 = high_resolution_clock::now(); double ctaken1 = elaspe_time(cend1, cstart1); std::cerr << "get properties " << std::setw(8) << ctaken1 < +< " secs\n"; cstart2 = high_resolution_clock::now(); } #ifdef _OPENMP else { merge_id = -1; int nfiles_processed = 0; int gettime_rendered = 0; #pragma omp parallel { int tid = omp_get_thread_num(); #pragma omp for nowait schedule(static, 1) for ( int i = 0; i < nfiles; ++i ) { get_properties(fname[i], Map[tid]); #pragma omp atomic nfiles_processed += 1; } #pragma omp critical { // report time for get properties if ( nfiles_processed == nfiles && !gettime_rendered ) { cend1 = high_resolution_clock::now(); double ctaken1 = elaspe_time(cend1, cstart1); std::cerr << "get properties " << std::setw(8) << +ctaken1 << " secs\n"; cstart2 = high_resolution_clock::now(); gettime_rendered = 1; } // merge array if ( Map[tid].size() > 0 ) { if ( merge_id < 0 ) { // container id other threads merge to merge_id = tid; } else { for ( auto const& x : Map[tid] ) { // Map[merge_id][x.first] += x.second; const auto [it, success] = Map[merge_id].emplace( +x.first, x.second); if ( !success ) it->second += x.second; } Map[tid].clear(); } } } } } #endif if ( merge_id < 0 || Map[merge_id].size() == 0 ) { std::cerr << "No work, exiting...\n"; return 1; } // Store the properties into a vector vec_str_int_type propvec( Map[merge_id].begin(), Map[merge_id].end( +) ); Map[merge_id].clear(); cend2 = high_resolution_clock::now(); double ctaken2 = elaspe_time(cend2, cstart2); std::cerr << "finish merging " << std::setw(8) << ctaken2 << " + secs\n"; cstart3 = high_resolution_clock::now(); // Sort the vector by (count) in reverse order, (name) in lexical o +rder #if USE_BOOST_PARALLEL_SORT == 0 std::sort( // Standard sort propvec.begin(), propvec.end(), [](const str_int_type& left, const str_int_type& right) { return left.second != right.second ? left.second > right.second : left.first < right.first; } ); #else boost::sort::block_indirect_sort( // Parallel sort propvec.begin(), propvec.end(), [](const str_int_type& left, const str_int_type& right) { return left.second != right.second ? left.second > right.second : left.first < right.first; }, nthrs_sort ); #endif cend3s = high_resolution_clock::now(); // Output the sorted vector #ifdef MAX_STR_LEN_L for ( auto const& n : propvec ) ::print(fast_io::concatln(fast_io::mnp::os_c_str(n.first.data(), + MAX_STR_LEN_L), "\t", n.second)); #else for ( auto const& n : propvec ) ::print(fast_io::concatln(n.first, "\t", n.second)); #endif cend3 = high_resolution_clock::now(); double ctaken = elaspe_time(cend3, cstart1); double ctaken3s = elaspe_time(cend3s, cstart3); double ctaken3o = elaspe_time(cend3, cend3s); std::cerr << "vector stable sort " << std::setw(8) << ctaken3s << +" secs\n"; std::cerr << "write stdout " << std::setw(8) << ctaken3o << +" secs\n"; std::cerr << "total time " << std::setw(8) << ctaken << +" secs\n"; // Hack to see Private Bytes in Windows Task Manager // (uncomment next line so process doesn't exit too quickly) // std::this_thread::sleep_for(milliseconds(9000)); return 0; }

        Parallel HashMap References Added Later

        • 2024 update: marioroy noted that std::mutex is slower in gcc-14 than gcc-13; he found another way to use spin locks via a small C++ class on the web.

Re^3: Rosetta Code: Long List is Long (faster - vec)(faster++, and now parallel)
by Anonymous Monk on Jan 10, 2023 at 17:45 UTC
    The good news is... The bad news is...

    The good news is that I bring good news only! :) Modified J script is faster, more versatile, uses significantly less RAM, and has been tested with 9.04 engine to parallelize obvious low hanging fruits for additional speed boost.

    NB. ----------------------------------------------------------- NB. --- This file is "llil4.ijs" NB. --- Run as e.g.: NB. NB. jconsole.exe llil4.ijs big1.txt big2.txt big3.txt out.txt NB. NB. --- (NOTE: last arg is output filename, file is overwritten) NB. ----------------------------------------------------------- pattern =: 0 1 NB. ========> This line has a star in its right margin =======> NB. * args =: 2 }. ARGV fn_out =: {: args fn_in =: }: args NB. PAD_CHAR =: ' ' filter_CR =: #~ ~: & CR make_more_space =: ' ' I. @ ((LF = ]) +. (TAB = ])) } ] find_spaces =: I. @: = & ' ' read_file =: {{ 'fname pattern' =. y text =. make_more_space filter_CR fread fname selectors =. (|.!.0 , {:) >: find_spaces text width =. # pattern height =. width <. @ %~ # selectors append_diffs =. }: , 2& (-~/\) shuffle_dims =. (1 0 3 & |:) @ ((2, height, width, 1) & $) selectors =. append_diffs selectors selectors =. shuffle_dims selectors literal =. < @: (}:"1) @: (];. 0) & text "_1 numeric =. < @: (0&".) @: (; @: (<;. 0)) & text "_1 extract =. pattern & { using =. 1 & \ or_maybe =. ` ,(extract literal or_maybe numeric) using selectors }} read_many_files =: {{ 'fnames pattern' =. y ,&.>/"2 (-#pattern) ]\ ,(read_file @:(; &pattern)) "0 fnames NB. * }} 'words nums' =: read_many_files fn_in ; pattern t1 =: (6!:1) '' NB. time since engine start 'words nums' =: (~. words) ; words +//. nums NB. * 'words nums' =: (\: nums)& { &.:>"_1 words ; nums words =: ; nums < @ /:~/. words t2 =: (6!:1) '' NB. time since engine start text =: , words ,. TAB ,. (": ,. nums) ,. LF erase 'words' ; 'nums' text =: (#~ ~: & ' ') text text fwrite fn_out erase < 'text' t3 =: (6!:1) '' NB. time since engine start echo 'Read and parse input: ' , ": t1 echo 'Classify, sum, sort: ' , ": t2 - t1 echo 'Format and write output: ' , ": t3 - t2 echo 'Total time: ' , ": t3 echo '' echo 'Finished. Waiting for a key...' stdin '' exit 0

    Code above doesn't (yet) include any 9.04 features and runs OK with 9.03, but I found 9.04 slightly faster in general. I also found 9.04 a bit faster on Windows, it's opposite to what I have seen for 9.03 (script ran faster on Linux), let's shrug it off because of 9.04 beta status and/or my antique PC. Results below are for beta 9.04 on Windows 10 (RAM usage taken from Windows Task Manager):

    > jconsole.exe llil4.ijs big1.txt big2.txt big3.txt out.txt Read and parse input: 1.501 Classify, sum, sort: 2.09 Format and write output: 1.318 Total time: 4.909 Finished. Waiting for a key... Peak working set (memory): 376,456K

    There are 3 star-marked lines. To patch for 9.04 new features to enable parallelization, replace them with these counterparts:

    {{ for. i. 3 do. 0 T. 0 end. }} '' ,&.>/"2 (-#pattern) ]\ ,;(read_file @:(; &pattern)) t.'' "0 fnames 'words nums' =: (~.t.'' words) , words +//. t.'' nums

    As you see, 1st line replaces comment, 2nd and 3d lines require just minor touches. 2nd line launches reading and parsing of input files in parallel. 3d line says to parallelize filtering for unique words and summing numbers according to words classification. Kind of redundant double work, even as it was, as I see it. The 1st line starts 3 additional worker threads. I don't have more cores with my CPU anyway, and this script has no work easily dispatched to more workers. Then:

    Read and parse input: 0.992 Classify, sum, sort: 1.849 Format and write output: 1.319 Total time: 4.16

    I would call my parallelization attempt, however crude it was, a success. Next is output for our second "official" dataset in this thread:

    > jconsole.exe llil4.ijs long1.txt long2.txt long3.txt out.txt Read and parse input: 1.329 Classify, sum, sort: 0.149 Format and write output: 0.009 Total time: 1.487

    ########################################################

    These are my results for latest C++ solution (compiled using g++), to compare my efforts to:

    $ ./llil2vec_11149482 big1.txt big2.txt big3.txt >vec.tmp llil2vec start get_properties CPU time : 3.41497 secs emplace set sort CPU time : 1.04229 secs write stdout CPU time : 1.31578 secs total CPU time : 5.77311 secs total wall clock time : 5 secs $ ./llil2vec_11149482 long1.txt long2.txt long3.txt >vec.tmp llil2vec start get_properties CPU time : 1.14889 secs emplace set sort CPU time : 0.057158 secs write stdout CPU time : 0.003307 secs total CPU time : 1.20943 secs total wall clock time : 2 secs $ ./llil2vec_11149482 big1.txt big2.txt big3.txt >vec.tmp llil2vec (fixed string length=6) start get_properties CPU time : 2.43187 secs emplace set sort CPU time : 0.853877 secs write stdout CPU time : 1.33636 secs total CPU time : 4.62217 secs total wall clock time : 5 secs

    I noticed that new C++ code, supposed to be faster, is actually slower (compared to llil2grt) with "long" dataset from two "official" datasets used in this thread.

      More impressive analysis from our mysterious anonymonk.

      I noticed that new C++ code, supposed to be faster, is actually slower (compared to llil2grt) with "long" dataset from two "official" datasets used in this thread

      Good catch! I was so hyper-focused on trying to beat the lowest time for the simple "big1.txt big2.txt big3.txt" test case that most folks were using, I didn't even test the "long1.txt long2.txt long3.txt" test case. :( With a few seconds thought it seems obvious that very long words will favour the hash-based approach over the vector-based one. I'm pleased in a way because my hash-based solution is so much simpler and clearer than my vector-based one ... plus it encourages us to explore different approaches.

      I'm happy for folks to continue to claim the fastest time for the short string big1.txt big2.txt big3.txt test case. That's like the 100 metre sprint. We should add a marathon world record for fastest long1.txt long2.txt long3.txt long4.txt long5.txt long6.txt test case. And maybe a 1500 metre world record for big1.txt big2.txt big3.txt long1.txt long2.txt long3.txt. :)

      Update: Timings for these three Olympic Events

      Run under Ubuntu on my modest laptop against the code here: Interim version of llil3grt.cpp and llil4vec-tbb.cpp.

      llil4vec easily won the 100 metre sprint; llil3grt easily won the marathon; llil4vec narrowly won the decider, the 1500 metres.

      --------------------- 100 metre sprint ------------------------------- +---- $ ./llil3grt big1.txt big2.txt big3.txt >grt.tmp llil3grt start get_properties CPU time : 2.63733 secs emplace set sort CPU time : 1.1302 secs write stdout CPU time : 1.17949 secs total CPU time : 4.94712 secs $ NUM_THREADS=6 ./llil4vec-tbb big1.txt big2.txt big3.txt >4vec.tmp llil4vec-tbb (fixed string length=6) start use TBB get properties time : 0.395106 secs sort properties time : 0.325554 secs emplace set sort time : 0.701055 secs write stdout time : 0.541473 secs total time : 1.96334 secs --------------------------------- 1500 metre ------------------------- +---- $ ./llil3grt big1.txt big2.txt big3.txt long1.txt long2.txt long3.txt + >big-long-grt.tmp llil3grt start get_properties CPU time : 3.13399 secs emplace set sort CPU time : 1.0814 secs write stdout CPU time : 1.30358 secs total CPU time : 5.51907 secs $ NUM_THREADS=6 ./llil4vec-tbb big1.txt big2.txt big3.txt long1.txt lo +ng2.txt long3.txt >big-long-4vec.tmp llil4vec-tbb start use TBB get properties time : 1.05054 secs sort properties time : 1.13865 secs emplace set sort time : 1.16597 secs write stdout time : 0.449603 secs total time : 3.80495 secs ---------------------------------- marathon -------------------------- +---- $ ./llil3grt long1.txt long2.txt long3.txt long4.txt long5.txt long6.t +xt >long-long-3grt.tmp llil3grt start get_properties CPU time : 0.846717 secs emplace set sort CPU time : 0.005067 secs write stdout CPU time : 0.002561 secs total CPU time : 0.854441 secs $ NUM_THREADS=6 ./llil4vec-tbb long1.txt long2.txt long3.txt long4.txt + long5.txt long6.txt >long-long-4vec.tmp llil4vec-tbb start use TBB get properties time : 2.36002 secs sort properties time : 1.28398 secs emplace set sort time : 0.118658 secs write stdout time : 0.001721 secs total time : 3.76457 secs ---------------------------------------------

      What follows are mostly my attempts to describe and explain i.e. fair amount of prose, perhaps not worth a copper (insert name for local coin of lowest denomination).

      First of all, thanks for warm words marioroy, I'm glad you liked this cute toy. I think you'll be particularly interested in parallelization, how they do it in J. And BTW the key to press is Ctrl-D, to "slurp" from STDIN.

      Plus, to keep this node from being completely off-topic among Perl congregation -- some monks might also be interested in, that, if any FFI module for Perl is allowed to be used for work or play, it's very easy to execute J phrases, calling DLL and reading/writing packed Perl scalars as J data. It's been awhile since I had that fun, but, e.g., a link [1] to get started, it is about calling J from J REPL, but I remember those experiments were quite useful for learning. Whoever gets interested will easily find more tutorials.

      In my 1st script, I actually posted code which is borderline buggy, and felt uneasy about it since previous year:

      text =: , freads " 0 fn_in NB. wrong text =: ; < @ freads " 0 fn_in NB. fixed

      What's right of comma in 1st line, reads into N x M character table (then comma converts it into 1D list), number of files by longest file size, with short files padded. If files were not same size, result would be erroneous. Further, if 100 files were 1 MB, and 101-st file is 1 GB, then this would end with "out of RAM" or similar error. Neither line is used in modified code, but with this off my chest, let's move on.

      I noticed that very slow part of my solution was (and still is) reading and parsing. Moreover this trivial part (not subsequent sorting!) is main RAM consumer during script run. Whatever I tried didn't help much. Then I decided if that's so -- let's make it at least potentially useful for the future.

      Instead of ad hoc code to consume strictly 2-column input, I wrote a function ("verb") to read from (well-formed) TSV file with any number of columns, with a "pattern" provided to describe how to read columns, as either text or numbers. In our case, pattern is obviously 0 1. To keep script interface consistent, this value is hard-coded, but should be e.g. CLI argument, of course.

      Part of what is a "well-formed" TSV file for this verb, is that "text" doesn't contain ASCII-32 space character. You maybe noticed I commented-out PAD_CHAR definition, because changing it won't be very useful. Some more substantial changes would be required to allow ASCII-32 (and then to pad with ASCII-0, perhaps), because frame-fill for literal data type is space character, nothing else. I'm keeping the line commented-out as a reminder.

      Which leads us to the essential question -- why keep textual data as padded N x M table, instead of keeping strings of text (they may be any different lengths then) each in their own "box"? Box in J is atomic data type and sort of pointer to whatever is inside. Data structures of any complexity are implemented with nested boxes.

      The answer is -- because boxes are relatively expensive [2]. A few thousands (or 1e5 or so) are OK. An array of 10 million boxes (total number of lines in our dataset) easily eats a few GB of RAM and is very slow.

      This also explains why I didn't use DSV addon [3] from standard library. For similar reasons (while I'm at explaining this stage of my route of decision making) I decided against symbols [4] (irreducible and shared GST (global symbol table)? Don't know what to think of it).

      So, textual columns are parsed into padded 2D tables, but then I saw that slurping all files into single string prior to parsing is not required at all, if files are not too tiny. And indeed, RAM requirements for the script dropped significantly then. Further, parsing file by file makes obvious choice as to where to try to parallelize.

      To finish with data input, the CRs are filtered out, in modified script, unconditionally, and, also unconditionally, NOT injected on output, in both OSes. And BTW, for fread and fwrite from standard library there exist their counterparts freads and fwrites to strip CRs. Don't know why, they are slow compared to manual filtering. The latter many seconds slower, not even used in original script. The modified script got rid of freads, also.

      Moving on to how sorting was done in 1st script version -- actually with somewhat complicated and slow code, trying to sort word sub-ranges (per the same numeric count) without creating intermediate boxed arrays. In modified script, this is replaced with just a short simple phrase. As long as there are less than a few millions of unique counts, this would be OK w.r.t. RAM usage (but 1st version would get very slow then anyway).

      This leads to new verb in 9.04 version (Key dyad [5]). I thought using it would help to "pack" pairs of words and numbers; then do kind of "packed sort" trick. It turned out to be a few times slower, I didn't investigate and I abandoned this route.

      Moving to data output, requiring user to provide "numeric width" or whatever it was called was quite silly on my part; computing decimal logarithm of maximum would be OK :) And even so, it turned out that J formats numeric column (as opposed to row i.e. 1D array) automatically padded as required.

      My final thoughts would be that multithreading is new feature in 9.04 version, as I already said. Documentation says [6]:

      Even if your code does not create tasks, the system can take advantage of threadpool 0 to make primitives run faster

      At first I thought it means that simply spawning workers would be enough for J to run parallel. Kind of how e.g. PDL does [7] (or tries to do). It turned out to be not so, either because script is very simple, or this automatic parallelization is not yet enabled in beta. We'll see.

      1. https://code.jsoftware.com/wiki/Guides/DLLs/Calling_the_J_DLL
      2. https://www.jsoftware.com/help/jforc/performance_measurement__tip.htm#_Toc191734575
      3. https://code.jsoftware.com/wiki/Addons/tables/dsv
      4. https://code.jsoftware.com/wiki/Vocabulary/sco
      5. https://code.jsoftware.com/wiki/Vocabulary/slashdot#dyadic
      6. https://code.jsoftware.com/wiki/Vocabulary/tcapdot#dyadic
      7. https://metacpan.org/dist/PDL/view/Basic/Pod/ParallelCPU.pod
      

      E-hm... hello? Are we still playing?

      Long thread is long. Should have known better before even beginning to think to start mumbling "paralle..li...", because of massive barrage of fire that ensued immediately after :). Hoping I chose correct version to test, and my dated PC (number of workers in particular) is poor workbench, but:

      $ time ./llil3vec_11149482 big1.txt big2.txt big3.txt >vec6.tmp llil3vec (fixed string length=6) start get_properties CPU time : 1.80036 secs emplace set sort CPU time : 0.815786 secs write stdout CPU time : 1.39233 secs total CPU time : 4.00856 secs total wall clock time : 4 secs real 0m4.464s user 0m3.921s sys 0m0.445s $ time ./llil3vec_11149482_omp big1.txt big2.txt big3.txt >vec6.tmp llil3vec (fixed string length=6) start get_properties CPU time : 2.06675 secs emplace set sort CPU time : 0.94937 secs write stdout CPU time : 1.40311 secs total CPU time : 4.41929 secs total wall clock time : 4 secs real 0m3.861s user 0m4.356s sys 0m0.493s

      ----------------------------------------------

      Then I sent my workers to retirement to plant or pick flowers or something i.e. (temporarily) reverted to single-threaded code, walked around (snow, no flowers), made a few changes, here's comparing previous and new versions:

      $ time ../j903/bin/jconsole llil4.ijs big1.txt big2.txt big3.txt out_j +.txt Read and parse input: 1.6121 Classify, sum, sort: 2.23621 Format and write output: 1.36701 Total time: 5.21532 real 0m5.220s user 0m3.934s sys 0m1.195s $ time ../j903/bin/jconsole llil5.ijs big1.txt big2.txt big3.txt out_j +.txt Read and parse input: 1.40811 Classify, sum, sort: 1.80736 Format and write output: 0.373946 Total time: 3.58941 real 0m3.594s user 0m2.505s sys 0m0.991s $ diff vec6.tmp out_j.txt $

      New script:

      NB. ----------------------------------------------------------- NB. --- This file is "llil5.ijs" NB. --- Run as e.g.: NB. NB. jconsole.exe llil5.ijs big1.txt big2.txt big3.txt out.txt NB. NB. --- (NOTE: last arg is output filename, file is overwritten) NB. ----------------------------------------------------------- pattern =: 0 1 args =: 2 }. ARGV fn_out =: {: args fn_in =: }: args filter_CR =: #~ ~: & CR read_file =: {{ 'fname pattern' =. y text =. TAB, filter_CR fread fname text =. TAB (I. text = LF) } text selectors =. I. text = TAB width =. # pattern height =. width <. @ %~ # selectors append_diffs =. }: , 2& (-~/\) shuffle_dims =. (1 0 3 & |:) @ ((2, height, width, 1) & $) selectors =. append_diffs selectors selectors =. shuffle_dims selectors literal =. < @: (}."1) @: (];. 0) & text "_1 numeric =. < @: (0&".) @: (; @: (<;. 0)) & text "_1 extract =. pattern & { using =. 1 & \ or_maybe =. ` ,(extract literal or_maybe numeric) using selectors }} read_many_files =: {{ 'fnames pattern' =. y ,&.>/"2 (-#pattern) ]\ ,(read_file @:(; &pattern)) "0 fnames }} 'words nums' =: read_many_files fn_in ; pattern t1 =: (6!:1) '' NB. time since engine start idx =: i.~ words nums =: idx +//. nums idx =: nums </. ~. idx words =: (/:~ @: { &words)&.> idx erase < 'idx' nums =: ~. nums 'words nums' =: (\: nums)& { &.:>"_1 words ; nums t2 =: (6!:1) '' NB. time since engine start text =: ; words (, @: (,"1 _))&.(>`a:)"_1 TAB ,. (": ,. nums) ,. LF erase 'words' ; 'nums' text =: (#~ ~: & ' ') text text fwrite fn_out erase < 'text' t3 =: (6!:1) '' NB. time since engine start echo 'Read and parse input: ' , ": t1 echo 'Classify, sum, sort: ' , ": t2 - t1 echo 'Format and write output: ' , ": t3 - t2 echo 'Total time: ' , ": t3 exit 0 echo '' echo 'Finished. Waiting for a key...' stdin '' exit 0

      ----------------------------------------------

      I don't know C++ "tools" chosen above ("modules" or whatever they called) at all; is capping the length to "6" in code just matter of convenience; any longer value could be hard-coded instead, like "12" or "25" (with obvious other fixes)? I mean, no catastrophic (cubic, etc.) slow-down would happen to sorting after some threshold? Therefore forcing to comment-out the define and use alternative set of "tools"? Perhaps input would be slower if cutting to unequally long words is expected?

      Anyway, here's output if the define is commented-out:

      $ time ./llil3vec_11149482_no6 big1.txt big2.txt big3.txt >vec6.tmp llil3vec start get_properties CPU time : 3.19387 secs emplace set sort CPU time : 0.996694 secs write stdout CPU time : 1.32918 secs total CPU time : 5.5198 secs total wall clock time : 6 secs real 0m6.088s user 0m5.294s sys 0m0.701s $ time ./llil3vec_11149482_no6_omp big1.txt big2.txt big3.txt >vec6.tm +p llil3vec start get_properties CPU time : 3.99891 secs emplace set sort CPU time : 1.13424 secs write stdout CPU time : 1.41112 secs total CPU time : 6.54723 secs total wall clock time : 4 secs real 0m4.952s user 0m6.207s sys 0m0.842s

      Should my time be compared to them? :) (Blimey, my solution doesn't have to compete when participants are capped selectively (/grumpy_on around here)). Or I can use powerful magic secret turbo mode:

      turbo_mode_ON =: {{ assert. 0 <: c =. 8 - {: $y h =. (3 (3!:4) 16be2), ,|."1 [3 (3!:4)"0 (4:,#,1:,#) y 3!:2 h, ,y ,"1 _ c # ' ' }} turbo_mode_OFF =: {{ (5& }. @: (_8& (]\)) @: (2& (3!:1))) &.> y }}

      Inject these definitions, and these couple lines immediately after t1 =: and before t2 =: respectively:

      words =: turbo_mode_ON words words =: turbo_mode_OFF words

      Aha:

      $ time ../j903/bin/jconsole llil5.ijs big1.txt big2.txt big3.txt out_j +.txt Read and parse input: 1.40766 Classify, sum, sort: 1.24098 Format and write output: 0.455868 Total time: 3.1045 real 0m3.109s user 0m1.815s sys 0m1.210s

      (and no cutting to pieces of pre-defined equal length was used ...yet) :)

      ----------------------------------------------

      I can revert to parallel reading/parsing anytime, with effect as shown in parent node. As implemented, it was kind of passive; but files can be unequal sizes, or just one huge single file. I think serious solution would probe inside to find newlines at approx. addresses, then pass chunks coords to workers to parse in parallel.

      Puny 2-workers attempt to sort, in parent, was just kind of #pragma omp parallel sections... thing with 2 sections; no use to send bus-loads of workers and expect quiet fans. There's some hope for "parallelizable primitives" in release (not beta) 9.04 or later. Maybe it's long time to wait. Or, if I could write code to merge 2 sorted arrays faster than built-in primitive sorts any of the halves -- then, bingo, I have multi-threaded fast merge-sort. But no success yet, the built-in sorts one large array faster, in single-thread.

        What a delight for our Anonymonk friend to come back. Thanks to you, we tried parallel :).

        ... but files can be unequal sizes, or just one huge single file. I think serious solution would probe inside to find newlines at approx. addresses, then pass chunks coords to workers to parse in parallel.

        Chuma mentions 2,064 input files in the initial "Long list is long" thread. Processing a list of files in parallel is suited for this use case due to many files. Back in 2014, I wrote utilities that support both chunking and list modes; mce_grep and egrep.pl via --chunk-level={auto|file|list}.

        llil5p.ijs

        I took llil5.ijs and created a parallel version named llil5p.ijs, based on code-bits from your prior post. The number of threads can be specified via the NUM_THREADS environment variable.

        $ diff -u llil5.ijs llil5p.ijs --- llil5.ijs 2023-01-18 09:25:14.041515970 -0600 +++ llil5p.ijs 2023-01-18 09:25:58.889669110 -0600 @@ -9,6 +9,12 @@ pattern =: 0 1 +nthrs =: 2!:5 'NUM_THREADS' NB. get_env NUM_THREADS +{{ + if. nthrs do. nthrs =: ".nthrs end. NB. string to integer conversio +n + for. i. nthrs do. 0 T. 0 end. NB. spin nthrs +}} '' + args =: 2 }. ARGV fn_out =: {: args fn_in =: }: args @@ -44,7 +50,7 @@ read_many_files =: {{ 'fnames pattern' =. y - ,&.>/"2 (-#pattern) ]\ ,(read_file @:(; &pattern)) "0 fnames + ,&.>/"2 (-#pattern) ]\ ,;(read_file @:(; &pattern)) t.'' "0 fnames }} 'words nums' =: read_many_files fn_in ; pattern

        llil5tp.ijs

        Next, I applied the turbo update to the parallel version and named it llil5tp.ijs.

        $ diff -u llil5p.ijs llil5tp.ijs --- llil5p.ijs 2023-01-18 09:25:58.889669110 -0600 +++ llil5tp.ijs 2023-01-18 09:26:01.553736512 -0600 @@ -21,6 +21,16 @@ filter_CR =: #~ ~: & CR +turbo_mode_ON =: {{ + assert. 0 <: c =. 8 - {: $y + h =. (3 (3!:4) 16be2), ,|."1 [3 (3!:4)"0 (4:,#,1:,#) y + 3!:2 h, ,y ,"1 _ c # ' ' +}} + +turbo_mode_OFF =: {{ + (5& }. @: (_8& (]\)) @: (2& (3!:1))) &.> y +}} + read_file =: {{ 'fname pattern' =. y @@ -56,6 +66,7 @@ 'words nums' =: read_many_files fn_in ; pattern t1 =: (6!:1) '' NB. time since engine start +words =: turbo_mode_ON words idx =: i.~ words nums =: idx +//. nums @@ -65,6 +76,7 @@ nums =: ~. nums 'words nums' =: (\: nums)& { &.:>"_1 words ; nums +words =: turbo_mode_OFF words t2 =: (6!:1) '' NB. time since engine start text =: ; words (, @: (,"1 _))&.(>`a:)"_1 TAB ,. (": ,. nums) ,. LF

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11149443]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (3)
As of 2025-07-17 03:44 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?
    erzuuliAnonymous Monks are no longer allowed to use Super Search, due to an excessive use of this resource by robots.