in reply to Re: Rosetta Code: Long List is Long (faster) in thread Rosetta Code: Long List is Long
J script runs in ~5.7 sec using ~650 MB to produce exactly same output.
Which makes it fastest of all solutions so far.
The good news is that I've now found two different C++ versions that are faster.
The bad news is that they're much uglier than
my original llil2grt.cpp (which ran in about 6.2 seconds on Ubuntu).
After getting nowhere trying to speed up llil2grt.cpp, I reluctantly decided that the simple and elegant
hash_ret[word] -= count
had to go.
First the timings on Ubuntu. Because there's quite a bit
of natural variation between runs, I did three runs of each:
llil2vec start
get_properties CPU time : 3.06313 secs
emplace set sort CPU time : 0.923435 secs
write stdout CPU time : 1.392 secs
total CPU time : 5.37868 secs
total wall clock time : 6 secs
llil2vec start
get_properties CPU time : 3.1567 secs
emplace set sort CPU time : 0.970294 secs
write stdout CPU time : 1.22305 secs
total CPU time : 5.35015 secs
total wall clock time : 5 secs
llil2vec start
get_properties CPU time : 3.32019 secs
emplace set sort CPU time : 1.08277 secs
write stdout CPU time : 1.22461 secs
total CPU time : 5.62766 secs
total wall clock time : 5 secs
Ave CPU time: 5.5 secs
(Memory use (Windows Private Bytes): 1,225,580K)
llil2vec (fixed string length=6) start
get_properties CPU time : 2.09353 secs
emplace set sort CPU time : 0.795144 secs
write stdout CPU time : 1.20994 secs
total CPU time : 4.09871 secs
total wall clock time : 4 secs
llil2vec (fixed string length=6) start
get_properties CPU time : 2.2078 secs
emplace set sort CPU time : 0.707252 secs
write stdout CPU time : 1.14867 secs
total CPU time : 4.06383 secs
total wall clock time : 4 secs
llil2vec (fixed string length=6) start
get_properties CPU time : 2.39225 secs
emplace set sort CPU time : 1.0033 secs
write stdout CPU time : 1.22765 secs
total CPU time : 4.62331 secs
total wall clock time : 4 secs
Ave CPU time: 4.3 secs
(Memory use (Windows Private Bytes): 814,940K)
The first one still uses std::string so there are no limitations on word length.
The second one is limited to a word length of six characters, and so works with
big1.txt, while crashing spectacularly with long1.txt.
Funny, I'd originally forgotten that
way back in 2014, when desperately trying to make a search program run a hundred times faster,
I'd stumbled upon this classic quote:
That's not a minor disadvantage, we're talking about things
being 50 to 100 times slower with a linked list.
Compactness matters. Vectors are more compact than lists.
And predictable usage patterns matter enormously.
With a vector, you have to shove a lot of elements over,
but caches are really really good at that.
When you traverse lists you keep doing random access ...
so you are actually random accessing your memory and you
are maximizing your cache misses, which is exactly the
opposite of what you want.
-- Bjarne Stroustrup: Why you should avoid Linked Lists (3:30-5:00) (update: see also)
Back then, because each memory access into my (huge) 4GB lookup tables was essentially random,
most memory accesses missed the L1 cache, missed the L2 cache, missed the L3 cache,
then waited for the cache line to be loaded into all three caches from main memory,
while often incurring a TLB cache miss, just to rub salt into the wounds.
Thankfully, there are no 4 GB lookup tables in this problem.
But hashes (and linked lists) still tend to be mind-bogglingly slower than vectors on modern hardware,
as indicated by Stroustrup's quote above.
So I reluctantly decided that the simple and elegant hash_ret[word] -= count had to go.
Though I really hate this new std::vector based solution, it does run faster than my earlier std::map based one.
This is just a first attempt, further improvements are possible.
The llil2vec.cpp source code is shown below. Though there are two sets of timings, there is only one program,
compiled with or without MAX_STR_LEN_L defined.
Suggestions for improvements welcome.
llil2vec.cpp
// llil2vec.cpp.
// Vector version of llil2grt.pl.
// g++ compile on Linux:
// g++ -o llil2vec -std=c++11 -Wall -O3 llil2vec.cpp
// This g++ command also works with mingw C++ compiler (https://source
+forge.net/projects/mingw-w64)
// that comes bundled with Strawberry Perl (C:\Strawberry\c\bin\g++.ex
+e).
// Example run: llil2vec big1.txt big2.txt big3.txt >vec.tmp
#include <cstddef>
#include <cstdint>
#include <cstdlib>
#include <cstring>
#include <ctime>
#include <cstdio>
#include <string>
#include <array>
#include <vector>
#include <set>
#include <algorithm>
#include <utility>
#include <iterator>
#include <iostream>
#include <fstream>
#include <sstream>
static_assert(sizeof(size_t) == sizeof(int64_t), "size_t too small, ne
+ed a 64-bit compile");
// -------------------------------------------------------------------
+---------
// Crude hack to see Windows Private Bytes in Task Manager by sleeping
+ at
// program end (see also sleep hack at end of main)
// #include <chrono>
// #include <thread>
// -------------------------------------------------------------------
+---------
typedef long long llil_int_type;
// Note: all words in big1.txt, big2.txt, big3.txt are <= 6 chars in l
+ength
// To use (limited length) fixed length strings uncomment the next lin
+e
// #define MAX_STR_LEN_L 6
#ifdef MAX_STR_LEN_L
using str_arr_type = std::array<char, MAX_STR_LEN_L + 1>; // +1
+ for trailing '\0'
using str_int_type = std::pair<str_arr_type, llil_int_type>;
using int_str_type = std::pair<llil_int_type, str_arr_type>;
#else
using str_int_type = std::pair<std::string, llil_int_type>;
using int_str_type = std::pair<llil_int_type, std::string>;
#endif
using vec_str_int_type = std::vector<str_int_type>;
using vec_int_str_type = std::vector<int_str_type>;
using set_int_str_type = std::set<int_str_type>;
// Mimic the Perl get_properties subroutine --------------------------
+--
// Limit line length and use ANSI C functions to try to boost performa
+nce
#define MAX_LINE_LEN_L 255
static void get_properties(
int nfiles, // in: the number of input files
char* fname[], // in: the input file names
vec_int_str_type& vec_ret) // out: a vector of properties
{
FILE* fh;
char line[MAX_LINE_LEN_L+1];
char* word;
llil_int_type count;
for (int i = 0; i < nfiles; ++i) {
fh = ::fopen(fname[i], "r");
if (fh == NULL) {
std::cerr << "Error opening '" << fname[i] << "' : errno=" <<
+ errno << "\n";
continue;
}
while ( ::fgets(line, MAX_LINE_LEN_L, fh) != NULL ) {
word = ::strtok(line, "\t");
count = ::atoll( ::strtok(NULL, "\n") );
#ifdef MAX_STR_LEN_L
str_arr_type fixword { { '\0', '\0', '\0', '\0', '\0', '\0',
+'\0' } };
::strcpy( fixword.data(), word );
vec_ret.emplace_back( -count, fixword );
#else
vec_ret.emplace_back( -count, word );
#endif
}
::fclose(fh);
}
// Needs to be sorted by word for later sum of adjacent count field
+s to work
std::sort( vec_ret.begin(), vec_ret.end(),
[](const int_str_type& left, const int_str_type& right) { return
+ left.second < right.second; }
);
}
// -------------------------------------------------------------------
+--
int main(int argc, char* argv[])
{
if (argc < 2) {
std::cerr << "usage: llil2vec file1 file2 ... >out.txt\n";
return 1;
}
#ifdef MAX_STR_LEN_L
std::cerr << "llil2vec (fixed string length=" << MAX_STR_LEN_L << "
+) start\n";
#else
std::cerr << "llil2vec start\n";
#endif
time_t tstart1 = ::time(NULL);
clock_t cstart1 = ::clock();
// Create the vector of properties
vec_int_str_type vec_ret;
get_properties(argc - 1, &argv[1], vec_ret);
clock_t cend1 = ::clock();
double ctaken1 = (double) (cend1 - cstart1) / (double)CLOCKS_PER_SE
+C;
std::cerr << "get_properties CPU time : " << ctaken1 << " secs
+\n";
clock_t cstart2 = ::clock();
// To avoid calling sort(), create an inverted std::set container
// Note: negative count gives desired ordering
set_int_str_type myset;
auto it = vec_ret.begin();
int_str_type kv_last = *it;
llil_int_type count = it->first;
for (++it; it != vec_ret.end(); ++it) {
if ( it->second == kv_last.second ) {
count += it->first;
}
else {
myset.emplace_hint( myset.end(), count, kv_last.second );
kv_last = *it;
count = it->first;
}
}
myset.emplace_hint( myset.end(), count, kv_last.second );
clock_t cend2s = ::clock();
// Output the (already sorted) std::set - no sort() function requir
+ed
// Note: fix up negative count via -n.first
#ifdef MAX_STR_LEN_L
for ( auto const& n : myset ) std::cout << n.second.data() << '\t'
+<< -n.first << '\n';
#else
for ( auto const& n : myset ) std::cout << n.second << '\t'
+<< -n.first << '\n';
#endif
clock_t cend2 = ::clock();
time_t tend2 = ::time(NULL);
long ttaken = static_cast<long>(::difftime(tend2, tstart1) + 0
+.5);
double ctaken = (double) (cend2 - cstart1) / (double)CLOCKS_PER_
+SEC;
double ctaken2s = (double) (cend2s - cstart2) / (double)CLOCKS_PER
+_SEC;
double ctaken2o = (double) (cend2 - cend2s) / (double)CLOCKS_PER
+_SEC;
std::cerr << "emplace set sort CPU time : " << ctaken2s << " sec
+s\n";
std::cerr << "write stdout CPU time : " << ctaken2o << " sec
+s\n";
std::cerr << "total CPU time : " << ctaken << " sec
+s\n";
std::cerr << "total wall clock time : " << ttaken << " sec
+s\n";
// Hack to see Private Bytes in Windows Task Manager (uncomment nex
+t line so process doesn't exit too quickly)
// std::this_thread::sleep_for(std::chrono::milliseconds(90000000
+));
return 0;
}
Re^3: Rosetta Code: Long List is Long (faster - vec)
by marioroy (Prior) on Jan 09, 2023 at 10:31 UTC
|
For fixed string length (I ran several times), clang++ computes ~ 0.3 seconds faster compared to g++. See also, the J script result with AVX2 enabled (total time 3.48103 secs).
# gcc 12.2.1
$ g++ -o llil2vec -std=c++11 -Wall -O3 -march=native -mtune=skylake -f
+align-functions=32 -fno-semantic-interposition -mno-vzeroupper -mpref
+er-vector-width=256 llil2vec.cpp
$ ./llil2vec big1.txt big2.txt big3.txt >out.txt
llil2vec (fixed string length=6) start
get_properties CPU time : 1.89609 secs
emplace set sort CPU time : 0.544972 secs
write stdout CPU time : 0.842451 secs
total CPU time : 3.28355 secs
total wall clock time : 4 secs
# clang version 15.0.4
$ clang++ -o llil2vec -std=c++11 -Wall -O3 -march=native -mtune=skylak
+e -falign-functions=32 -fno-semantic-interposition -mno-vzeroupper -m
+prefer-vector-width=256 llil2vec.cpp
$ ./llil2vec big1.txt big2.txt big3.txt >out.txt
llil2vec (fixed string length=6) start
get_properties CPU time : 1.67073 secs
emplace set sort CPU time : 0.474207 secs
write stdout CPU time : 0.828457 secs
total CPU time : 2.97343 secs
total wall clock time : 3 secs
| [reply] [d/l] |
|
Thanks! Yes, I'm seeing clang++ slightly faster too, but only for the limited length fixed string case.
Funny, I'd earlier tried clang++ for the non fixed string case, but it seemed to be very slightly slower than g++.
That result, combined with google suggesting that g++ usually produces slightly faster executables caused me to give up on clang++.
I also fiddled with some of the many compiler parameters but felt overwhelmed by the sheer number and complexity of them,
so just stuck to the basic ones for now.
The natural variations in timing of each run also make it hard to be certain.
After googling, I was thinking of something like:
#!/bin/sh
# Clear the caches (this needs root permissions)
sync; echo 3 > /proc/sys/vm/drop_caches
# Use taskset for cpu affinity (but which cpu to choose?)
taskset 0x00000002 ./llil2vec big1.txt big2.txt big3.txt >f.tmp
sleep 5
taskset 0x00000002 ./llil2vec big1.txt big2.txt big3.txt >f.tmp
sleep 5
taskset 0x00000002 ./llil2vec big1.txt big2.txt big3.txt >f.tmp
but was too gutless, given it needs root permissions and I didn't really know what I was doing.
| [reply] [d/l] [select] |
|
$ clang++ -o llil2vec -std=c++11 -Wall -O3 llil2vec.cpp
$ ./llil2vec big1.txt big2.txt big3.txt >out.txt
llil2vec (fixed string length=6) start
get_properties CPU time : 1.65488 secs
emplace set sort CPU time : 0.470765 secs
write stdout CPU time : 0.850356 secs
total CPU time : 2.97605 secs
total wall clock time : 3 secs
| [reply] [d/l] |
Re^3: Rosetta Code: Long List is Long (faster - vec - openmp)
by eyepopslikeamosquito (Archbishop) on Jan 10, 2023 at 12:12 UTC
|
Thanks to the excellent feedback from marioroy I've been able to improve the code as follows:
- Compile MAX_STR_LEN_L version with clang++, otherwise g++
- For MAX_STR_LEN_L version, reduced size of short strings by one char by not null-terminating them
- Use good old printf to write to stdout (seems a bit faster than streaming to std::cout)
- Minor code cleanups (especially around creating the inverted std::set)
New Timings for Long String Version
> ./llil2vec big1.txt big2.txt big3.txt >f.tmp
llil2vec start
get_properties CPU time : 3.06708 secs
emplace set sort CPU time : 0.903617 secs
write stdout CPU time : 1.23459 secs
total CPU time : 5.20544 secs
total wall clock time : 5 secs
> ./llil2vec big1.txt big2.txt big3.txt >f.tmp
llil2vec start
get_properties CPU time : 3.4832 secs
emplace set sort CPU time : 1.10209 secs
write stdout CPU time : 0.939805 secs
total CPU time : 5.52519 secs
total wall clock time : 6 secs
> ./llil2vec big1.txt big2.txt big3.txt >f.tmp
llil2vec start
get_properties CPU time : 3.42605 secs
emplace set sort CPU time : 0.916222 secs
write stdout CPU time : 1.00049 secs
total CPU time : 5.343 secs
total wall clock time : 5 secs
> diff f.tmp vec.tmp
This is an average time of 5.35 secs (compared to 5.5 secs in the original post).
New Timings for Short String (MAX_STR_LEN_L=6) Version
> For some reason, short string version is faster when compiled with c
+lang++
> (while long string version is not)
> clang++ -o llil2vec -std=c++11 -Wall -O3 llil2vec.cpp
> ./llil2vec big1.txt big2.txt big3.txt >f.tmp
llil2vec (fixed string length=6) start
get_properties CPU time : 1.87217 secs
emplace set sort CPU time : 0.589238 secs
write stdout CPU time : 0.842179 secs
total CPU time : 3.30369 secs
total wall clock time : 3 secs
> ./llil2vec big1.txt big2.txt big3.txt >f.tmp
llil2vec (fixed string length=6) start
get_properties CPU time : 1.95909 secs
emplace set sort CPU time : 0.610479 secs
write stdout CPU time : 0.959859 secs
total CPU time : 3.52965 secs
total wall clock time : 4 secs
> ./llil2vec big1.txt big2.txt big3.txt >f.tmp
llil2vec (fixed string length=6) start
get_properties CPU time : 1.86549 secs
emplace set sort CPU time : 0.608097 secs
write stdout CPU time : 0.857176 secs
total CPU time : 3.33087 secs
total wall clock time : 3 secs
> diff f.tmp vec.tmp
This is an average time of 3.4 secs (compared to 4.3 secs in the original post).
New version of llil2vec.cpp
Interim OpenMP Versions
Note: The interim versions below are just early (conservative) openmp-safe versions made thread-safe by
expunging the dreaded strtok function.
The one with the most potential for multi-threading seems to be llil3vec.cpp because
it uses vectors which can be safely used lock-free in thread-local vec_loc vectors, as shown by
marioroy in llil4vec.cpp.
While replacing std::sort and std::stable_sort with __gnu_parallel versions to sort std::vectors
is a no-brainer, it seems problematic to parallelise access to a global std::map (as used in llil3grt.cpp and llil3m.cpp below)
because I think you'd need some sort of locking to safely allow multiple threads update a shared global std::map.
Ideas welcome.
Interim version of llil3vec.cpp
Timings of llil3vec
> g++ -o llil3vec -std=c++20 -Wall -O3 llil3vec.cpp
> time ./llil3vec big1.txt big2.txt big3.txt >f.tmp
llil3vec (fixed string length=6) start
get_properties CPU time : 1.47895 secs
emplace set sort CPU time : 0.592783 secs
write stdout CPU time : 0.832071 secs
total CPU time : 2.90392 secs
total wall clock time : 3 secs
real 0m3.217s
user 0m2.806s
sys 0m0.412s
> diff f.tmp vec.tmp
> g++ -o llil3vec -std=c++20 -fopenmp -Wall -O3 llil3vec.cpp
> time ./llil3vec big1.txt big2.txt big3.txt >f.tmp
llil3vec (fixed string length=6) start
get_properties CPU time : 2.5809 secs
emplace set sort CPU time : 0.793355 secs
write stdout CPU time : 0.860855 secs
total CPU time : 4.23521 secs
total wall clock time : 3 secs
real 0m2.673s
user 0m4.073s
sys 0m0.465s
> diff f.tmp vec.tmp
For the -fopenmp version note that the Unix real time is lower, but the user time is higher.
Interim version of llil3grt.cpp
Interim version of llil3m.cpp
Timings
llil3grt (fixed string length=6) start
- openmp version
get_properties CPU time : 2.34487 secs
emplace set sort CPU time : 0.942098 secs
write stdout CPU time : 0.855157 secs
total CPU time : 4.14223 secs
total wall clock time : 4 secs
real 0m4.752s
user 0m4.241s
sys 0m0.511s
$ time ./llil3m big1.txt big2.txt big3.txt >f.tmp
llil3m (fixed string length=6) start
- openmp version
get_properties CPU time : 2.40089 secs
vector copy CPU time : 0.544294 secs
vector stable sort CPU time : 5.84981 secs
write stdout CPU time : 0.805509 secs
total CPU time : 9.6006 secs
total wall clock : 4 secs
real 0m4.713s
user 0m9.238s
sys 0m0.679s
llil3m (fixed string length=6) start
- openmp version
get_properties CPU time : 2.3031 secs
vector copy CPU time : 0.544613 secs
vector sort CPU time : 2.64047 secs
write stdout CPU time : 0.791186 secs
total CPU time : 6.27946 secs
total wall clock : 4 secs
real 0m4.182s
user 0m6.128s
sys 0m0.473s
$ time ./llil3m big1.txt big2.txt big3.txt >f.tmp
llil3m (fixed string length=6) start
get_properties CPU time : 2.28072 secs
vector copy CPU time : 0.471632 secs
vector stable sort CPU time : 0.735042 secs
write stdout CPU time : 0.67514 secs
total CPU time : 4.16263 secs
total wall clock : 4 secs
real 0m4.470s
user 0m4.159s
sys 0m0.311s
$ time ./llil3m big1.txt big2.txt big3.txt >f.tmp
llil3m (fixed string length=6) start
get_properties CPU time : 2.30618 secs
vector copy CPU time : 0.473185 secs
vector sort CPU time : 1.13081 secs
write stdout CPU time : 0.668702 secs
total CPU time : 4.57897 secs
total wall clock : 4 secs
real 0m4.889s
user 0m4.558s
sys 0m0.331s
$ time ./llil3vec big1.txt big2.txt big3.txt >f.tmp
llil3vec (fixed string length=6) start
get_properties CPU time : 1.46864 secs
emplace set sort CPU time : 0.630914 secs
write stdout CPU time : 0.847912 secs
total CPU time : 2.9476 secs
total wall clock time : 3 secs
real 0m3.233s
user 0m2.852s
sys 0m0.381s
$ time ./llil3grt big1.txt big2.txt big3.txt >f.tmp
llil3grt (fixed string length=6) start
get_properties CPU time : 2.34418 secs
emplace set sort CPU time : 0.901415 secs
write stdout CPU time : 0.90025 secs
total CPU time : 4.14595 secs
total wall clock time : 5 secs
real 0m4.784s
user 0m4.232s
sys 0m0.551s
References Added Later
Updated: The source code for llil2vec.cpp was updated after the node was first created
based on feedback from the replies. Added interim version llil3vec.cpp (with some marioroy improvements added).
Plus interim versions of llil3grt.cpp and llil3m.cpp.
Added a few missing #include <array> and reference for OpenMP Little Book (thanks marioroy).
| [reply] [d/l] [select] |
|
git clone --depth 1 https://github.com/cppfastio/fast_io
Add the header file above other includes and comment out the cstdio include line.
#include <fast_io.h>
//#include <cstdio>
Specify the path to the include folder. Also C++20.
clang++ -o llil2vec -std=c++20 -Wall -O3 llil2vec.cpp -I./fast_io/incl
+ude
Before C++11 results:
llil2vec (fixed string length=6) start
get_properties CPU time : 1.62356 secs
emplace set sort CPU time : 0.451718 secs
write stdout CPU time : 0.923496 secs
total CPU time : 2.99881 secs
total wall clock time : 3 secs
C++20 + fast_io results:
llil2vec (fixed string length=6) start
get_properties CPU time : 1.50394 secs
emplace set sort CPU time : 0.430272 secs
write stdout CPU time : 0.889239 secs
total CPU time : 2.82348 secs
total wall clock time : 3 secs
| [reply] [d/l] [select] |
|
Thanks so much for finding this github fast_io library and for posting such clear instructions on how to use it!
This made it ridiculously easy for me to play around with it.
So far I haven't found any statistically significant speed-ups from employing this library (hardly surprising given the mega man years
of effort poured into g++ and clang++), but I will keep my eye on it.
| [reply] [d/l] [select] |
|
|
|
|
|
|
I tried the following for the interim version of llil3m.cpp.
The overhead for running parallel is greater than I hoped for using std::map. Then, I found parallel hashmap. From the link, "By using parallel SSE2 instructions, these hashmaps are able to look up items by checking 16 slots in parallel."
Running:
$ time ./llil2grt big1.txt big2.txt big3.txt >out.txt
llil2grt start
get_properties CPU time : 3.55473 secs
emplace set sort CPU time : 0.998423 secs
write stdout CPU time : 0.854453 secs
total CPU time : 5.40764 secs
total wall clock time : 5 secs
real 0m5.973s
user 0m5.697s
sys 0m0.292s
$ NUM_THREADS=1 ./llil4map-no6-uno big1.txt big2.txt big3.txt >out.txt
llil4map start
use std::unordered_map
don't use OpenMP
use boost sort
get properties 3.368 secs
finish merging 1.202 secs
vector stable sort 0.898 secs
write stdout 0.261 secs
total time 5.730 secs
$ NUM_THREADS=1 ./llil4map-no6-std big1.txt big2.txt big3.txt >out.txt
llil4map start
use std::map
don't use OpenMP
use boost sort
get properties 3.328 secs
finish merging 0.657 secs
vector stable sort 0.816 secs
write stdout 0.268 secs
total time 5.069 secs
$ NUM_THREADS=1 ./llil4map-no6-flat big1.txt big2.txt big3.txt >out.tx
+t
llil4map start
use phmap::parallel_flat_hash_map
don't use OpenMP
use boost sort
get properties 1.565 secs
map to vector 0.200 secs
vector stable sort 0.665 secs
write stdout 0.220 secs
total time 2.652 secs
llil4map.cc
| [reply] [d/l] [select] |
|
// llil4map.cpp
// https://perlmonks.com/?node_id=11149643
// OpenMP Little Book - https://nanxiao.gitbooks.io/openmp-little-book
//
// C++ version inspired by Mario's famous dualvar two-sort Perl soluti
+on.
// based on llil3m.cpp https://perlmonks.com/?node_id=11149482
// 1. Run get_properties in parallel.
// 2. Capture time and diff via chrono.
// 3. Threads: Flush local vector periodically; speed-up for std::m
+ap.
// 4. Key words null-terminated for MAX_STR_LEN_L.
// 5. Concat strings using fast_io::concatln during output.
// 6. Support NUM_THREADS environment variable.
// 7. Added define for using boost's parallel sort.
// 8. Removed parallel get_properties, not helpful due to overhead.
// 9. Removed MAX_STR_LEN_L for simply std::map.
// A. Replaced atoll with fast_atoll64.
// B. Fast vector sorting - from llil5vec-tbb.
// C. Re-enabled parallel get_properties.
// D. Exit early if no work; fast_io tweak writing to output;
// limit to 8 threads max for sorting.
// E. Switch to parallel hashmap; using SSE2 instructions.
// F. Support std::map, std::unordered_map, or parallel-hashmap via
+ define.
// G. Refactored OpenMP code to merge immediately.
// H. Improved get_properties; set precision for timings.
// I. Support fixed-length key words using basic_static_string.
// J. Remove support for std::map and std::unordered_map.
// K. Opt-in to use (limited length) fixed length strings.
// Uncomment line #define MAX_STR_LEN_L and update value.
// L. Improved the fixed length path with custom default_hash/eq fu
+nctions.
// M. Define str_type struct as std::array, overriding == and < ope
+rators.
//
// Obtain the fast_io library (required dependency):
// git clone --depth=1 https://github.com/cppfastio/fast_io
//
// Obtain the parallel hashmap library (required dependency):
// git clone --depth=1 https://github.com/greg7mdp/parallel-hashmap
//
// g++ compile on Linux: (boost header may need the -Wno-stringop-over
+flow gcc option)
// g++ -o llil4map -std=c++20 -Wall -O3 llil4map.cpp -I ./fast_io/i
+nclude -I ./parallel-hashmap
// g++ -o llil4map-omp -std=c++20 -fopenmp -Wall -O3 llil4map.cpp -
+I ./fast_io/include -I ./parallel-hashmap
//
// This g++ command also works with mingw C++ compiler (https://source
+forge.net/projects/mingw-w64)
// that comes bundled with Strawberry Perl (C:\Strawberry\c\bin\g++.ex
+e).
//
// clang++ compile: same args, without the -Wno-stringop-overflow opti
+on
// Seems to run slightly faster when compiled with clang++ instead of
+g++
//
// Obtain gen-llil.pl and gen-long-llil.pl from https://perlmonks.com/
+?node_id=11148681
// perl gen-llil.pl big1.txt 200 3 1
// perl gen-llil.pl big2.txt 200 3 1
// perl gen-llil.pl big3.txt 200 3 1
// perl gen-long-llil.pl long1.txt 600
// perl gen-long-llil.pl long2.txt 600
// perl gen-long-llil.pl long3.txt 600
//
// To make random input, obtain shuffle.pl from https://perlmonks.com/
+?node_id=11149800
// perl shuffle.pl big1.txt >tmp && mv tmp big1.txt
// perl shuffle.pl big2.txt >tmp && mv tmp big2.txt
// perl shuffle.pl big3.txt >tmp && mv tmp big3.txt
//
// Example run: llil4map big1.txt big2.txt big3.txt >out.txt
// NUM_THREADS=3 llil4map-omp ...
// Note: Lesser NUM_THREADS is preferred, else merge time incr
+eases
// Start with MIN( CPU cores, 1 + log2( number of input
+files ) )
// -------------------------------------------------------------------
+---------
// Specify 0/1 to use boost's parallel sorting algorithm; faster than
+__gnu_parallel::sort.
// https://www.boost.org/doc/libs/1_81_0/libs/sort/doc/html/sort/paral
+lel.html
// This requires the boost header files: e.g. devpkg-boost bundle on C
+lear Linux.
// Note: Another option is downloading and unpacking Boost locally.
// (no need to build it because the bits we use are header file only)
#define USE_BOOST_PARALLEL_SORT 1
#include <chrono>
#include <thread>
// The fast_io header must come after chrono, else build error:
// "no member named 'concatln' in namespace 'fast_io'"
#include <fast_io.h>
#include <parallel_hashmap/phmap.h>
#include <cstdio>
#include <cstddef>
#include <cstdint>
#include <cstdlib>
#include <cstring>
#include <ctime>
#include <string>
#include <array>
#include <vector>
#include <utility>
#include <iterator>
#include <execution>
#include <algorithm>
#include <boost/container_hash/hash.hpp>
#if USE_BOOST_PARALLEL_SORT > 0
#include <boost/sort/sort.hpp>
#endif
#ifdef _OPENMP
#include <omp.h>
#endif
#include <iostream>
#include <iomanip>
#include <fstream>
#include <sstream>
static_assert(sizeof(size_t) == sizeof(int64_t), "size_t too small, ne
+ed a 64-bit compile");
// -------------------------------------------------------------------
+---------
typedef long long llil_int_type;
// All words in big1.txt, big2.txt, big3.txt are <= 6 chars in length.
// big.txt max word length is 6
// long.txt max word length is 208
//
// Based on rough benchmarking, the short fixed string hack below is o
+nly
// worth trying for MAX_STR_LEN_L up to about 22.
// See also https://backlinko.com/google-keyword-study
//
// To use (limited length) fixed length strings uncomment the next lin
+e.
#define MAX_STR_LEN_L 6
#ifdef MAX_STR_LEN_L
struct str_type : std::array<char, MAX_STR_LEN_L> {
bool operator==( const str_type& o ) const {
return ::memcmp(this->data(), o.data(), MAX_STR_LEN_L) == 0;
}
bool operator<( const str_type& o ) const {
return ::memcmp(this->data(), o.data(), MAX_STR_LEN_L) < 0;
}
};
#else
struct str_type : std::basic_string<char> {
bool operator==( const str_type& o ) const {
return ::strcmp(this->data(), o.data()) == 0;
}
bool operator<( const str_type& o ) const {
return ::strcmp(this->data(), o.data()) < 0;
}
};
#endif
using str_int_type = std::pair<str_type, llil_int_type>;
using vec_str_int_type = std::vector<str_int_type>;
// inject specialization of std::hash for str_type into namespace std
namespace std {
template<> struct hash<str_type> {
std::size_t operator()( str_type const& v ) const noexcept {
#if 0
return boost::hash_range( v.cbegin(), v.cend() );
#else
std::basic_string_view<char> bv {
reinterpret_cast<const char*>(v.data()), v.size() * sizeof
+(char) };
return std::hash<std::basic_string_view<char>>()(bv);
#endif
}
};
}
// create the parallel_flat_hash_map without internal mutexes
using map_str_int_type = phmap::parallel_flat_hash_map<
str_type, llil_int_type,
phmap::priv::hash_default_hash<str_type>,
phmap::priv::hash_default_eq<str_type>,
phmap::priv::Allocator<phmap::priv::Pair<const str_type, llil_int_t
+ype>>,
8, phmap::NullMutex
>;
// fast_atoll64
// https://stackoverflow.com/questions/16826422/
// c-most-efficient-way-to-convert-string-to-int-faster-than-atoi
inline int64_t fast_atoll64( const char* str )
{
int64_t val = 0;
// int sign = 0;
// if ( *str == '-' ) {
// sign = 1, ++str;
// }
uint8_t digit;
while ((digit = uint8_t(*str++ - '0')) <= 9) val = val * 10 + digit
+;
// return sign ? -val : val;
return val;
}
// Mimic the Perl get_properties subroutine --------------------------
+--
// Limit line length and use ANSI C functions to try to boost performa
+nce
#define MAX_LINE_LEN_L 255
#define MAX_THREADS 32
static void get_properties(
const char* fname, // in : the input file name
map_str_int_type& map_upd // inout: the map to be updated
)
{
FILE* fh;
std::array<char, MAX_LINE_LEN_L + 1> line;
char* found;
llil_int_type count;
fh = ::fopen(fname, "r");
if ( fh == NULL ) {
std::cerr << "Error opening '" << fname << "' : " << strerror(er
+rno) << "\n";
return;
}
while ( ::fgets( line.data(), static_cast<int>(MAX_LINE_LEN_L), fh
+) != NULL ) {
found = std::find( line.begin(), line.end(), '\t' );
count = fast_atoll64( found + 1 );
// Note: using emplace() is faster than map_upd[fixword] += coun
+t;
#ifdef MAX_STR_LEN_L
str_type fixword {}; // {} initializes all elements of fixword
+to '\0'
std::copy( line.begin(), found, fixword.begin() );
const auto [it, success] = map_upd.emplace(fixword, count);
#else
*found = '\0';
const auto [it, success] = map_upd.emplace(str_type{ line.data()
+ }, count);
#endif
if ( !success ) it->second += count;
}
::fclose(fh);
}
typedef std::chrono::high_resolution_clock high_resolution_clock;
typedef std::chrono::high_resolution_clock::time_point time_point;
typedef std::chrono::milliseconds milliseconds;
double elaspe_time(
time_point cend,
time_point cstart)
{
return double (
std::chrono::duration_cast<milliseconds>(cend - cstart).count()
) * 1e-3;
}
// -------------------------------------------------------------------
+--
static map_str_int_type Map[MAX_THREADS];
int main(int argc, char* argv[])
{
if (argc < 2) {
std::cerr << "usage: llil4map file1 file2 ... >out.txt\n";
return 1;
}
std::cerr << std::setprecision(3) << std::setiosflags(std::ios::fix
+ed);
#ifdef MAX_STR_LEN_L
std::cerr << "llil4map (fixed string length=" << MAX_STR_LEN_L << "
+) start\n";
#else
std::cerr << "llil4map start\n";
#endif
#ifdef _OPENMP
std::cerr << "use OpenMP\n";
#else
std::cerr << "don't use OpenMP\n";
#endif
#if USE_BOOST_PARALLEL_SORT == 0
std::cerr << "don't use boost sort\n";
#else
std::cerr << "use boost sort\n";
#endif
time_point cstart1, cend1, cstart2, cend2, cstart3, cend3s, cend3;
cstart1 = high_resolution_clock::now();
#ifdef _OPENMP
// Determine the number of threads.
const char* env_nthrs = std::getenv("NUM_THREADS");
int nthrs = ( env_nthrs && strlen(env_nthrs) )
? ::atoi(env_nthrs)
: std::thread::hardware_concurrency();
if ( nthrs > MAX_THREADS ) nthrs = MAX_THREADS;
omp_set_dynamic(false);
omp_set_num_threads(nthrs);
#else
int nthrs = 1;
#endif
int nthrs_sort = ( std::thread::hardware_concurrency() < 12 )
? std::thread::hardware_concurrency()
: 12;
// Get the list of input files from the command line
int nfiles = argc - 1;
char** fname = &argv[1];
int merge_id = 0;
// Run parallel, depending on the number of threads
if ( nthrs == 1 || nfiles == 1 ) {
for ( int i = 0; i < nfiles; ++i ) {
get_properties(fname[i], Map[0]);
}
cend1 = high_resolution_clock::now();
double ctaken1 = elaspe_time(cend1, cstart1);
std::cerr << "get properties " << std::setw(8) << ctaken1 <
+< " secs\n";
cstart2 = high_resolution_clock::now();
}
#ifdef _OPENMP
else {
merge_id = -1;
int nfiles_processed = 0;
int gettime_rendered = 0;
#pragma omp parallel
{
int tid = omp_get_thread_num();
#pragma omp for nowait schedule(static, 1)
for ( int i = 0; i < nfiles; ++i ) {
get_properties(fname[i], Map[tid]);
#pragma omp atomic
nfiles_processed += 1;
}
#pragma omp critical
{
// report time for get properties
if ( nfiles_processed == nfiles && !gettime_rendered ) {
cend1 = high_resolution_clock::now();
double ctaken1 = elaspe_time(cend1, cstart1);
std::cerr << "get properties " << std::setw(8) <<
+ctaken1 << " secs\n";
cstart2 = high_resolution_clock::now();
gettime_rendered = 1;
}
// merge array
if ( Map[tid].size() > 0 ) {
if ( merge_id < 0 ) {
// container id other threads merge to
merge_id = tid;
}
else {
for ( auto const& x : Map[tid] ) {
// Map[merge_id][x.first] += x.second;
const auto [it, success] = Map[merge_id].emplace(
+x.first, x.second);
if ( !success ) it->second += x.second;
}
Map[tid].clear();
}
}
}
}
}
#endif
if ( merge_id < 0 || Map[merge_id].size() == 0 ) {
std::cerr << "No work, exiting...\n";
return 1;
}
// Store the properties into a vector
vec_str_int_type propvec( Map[merge_id].begin(), Map[merge_id].end(
+) );
Map[merge_id].clear();
cend2 = high_resolution_clock::now();
double ctaken2 = elaspe_time(cend2, cstart2);
std::cerr << "finish merging " << std::setw(8) << ctaken2 << "
+ secs\n";
cstart3 = high_resolution_clock::now();
// Sort the vector by (count) in reverse order, (name) in lexical o
+rder
#if USE_BOOST_PARALLEL_SORT == 0
std::sort(
// Standard sort
propvec.begin(), propvec.end(),
[](const str_int_type& left, const str_int_type& right) {
return left.second != right.second
? left.second > right.second
: left.first < right.first;
}
);
#else
boost::sort::block_indirect_sort(
// Parallel sort
propvec.begin(), propvec.end(),
[](const str_int_type& left, const str_int_type& right) {
return left.second != right.second
? left.second > right.second
: left.first < right.first;
},
nthrs_sort
);
#endif
cend3s = high_resolution_clock::now();
// Output the sorted vector
#ifdef MAX_STR_LEN_L
for ( auto const& n : propvec )
::print(fast_io::concatln(fast_io::mnp::os_c_str(n.first.data(),
+ MAX_STR_LEN_L), "\t", n.second));
#else
for ( auto const& n : propvec )
::print(fast_io::concatln(n.first, "\t", n.second));
#endif
cend3 = high_resolution_clock::now();
double ctaken = elaspe_time(cend3, cstart1);
double ctaken3s = elaspe_time(cend3s, cstart3);
double ctaken3o = elaspe_time(cend3, cend3s);
std::cerr << "vector stable sort " << std::setw(8) << ctaken3s <<
+" secs\n";
std::cerr << "write stdout " << std::setw(8) << ctaken3o <<
+" secs\n";
std::cerr << "total time " << std::setw(8) << ctaken <<
+" secs\n";
// Hack to see Private Bytes in Windows Task Manager
// (uncomment next line so process doesn't exit too quickly)
// std::this_thread::sleep_for(milliseconds(9000));
return 0;
}
Parallel HashMap References Added Later
- 2024 update: marioroy noted that std::mutex is slower in gcc-14 than gcc-13; he found another way to use spin locks via a small C++ class on the web.
| [reply] [d/l] [select] |
|
Re^3: Rosetta Code: Long List is Long (faster - vec)(faster++, and now parallel)
by Anonymous Monk on Jan 10, 2023 at 17:45 UTC
|
The good news is... The bad news is...
The good news is that I bring good news only! :) Modified J script is faster, more versatile, uses significantly less RAM, and has been tested with 9.04 engine to parallelize obvious low hanging fruits for additional speed boost.
NB. -----------------------------------------------------------
NB. --- This file is "llil4.ijs"
NB. --- Run as e.g.:
NB.
NB. jconsole.exe llil4.ijs big1.txt big2.txt big3.txt out.txt
NB.
NB. --- (NOTE: last arg is output filename, file is overwritten)
NB. -----------------------------------------------------------
pattern =: 0 1
NB. ========> This line has a star in its right margin =======> NB. *
args =: 2 }. ARGV
fn_out =: {: args
fn_in =: }: args
NB. PAD_CHAR =: ' '
filter_CR =: #~ ~: & CR
make_more_space =: ' ' I. @ ((LF = ]) +. (TAB = ])) } ]
find_spaces =: I. @: = & ' '
read_file =: {{
'fname pattern' =. y
text =. make_more_space filter_CR fread fname
selectors =. (|.!.0 , {:) >: find_spaces text
width =. # pattern
height =. width <. @ %~ # selectors
append_diffs =. }: , 2& (-~/\)
shuffle_dims =. (1 0 3 & |:) @ ((2, height, width, 1) & $)
selectors =. append_diffs selectors
selectors =. shuffle_dims selectors
literal =. < @: (}:"1) @: (];. 0) & text "_1
numeric =. < @: (0&".) @: (; @: (<;. 0)) & text "_1
extract =. pattern & {
using =. 1 & \
or_maybe =. `
,(extract literal or_maybe numeric) using selectors
}}
read_many_files =: {{
'fnames pattern' =. y
,&.>/"2 (-#pattern) ]\ ,(read_file @:(; &pattern)) "0 fnames NB. *
}}
'words nums' =: read_many_files fn_in ; pattern
t1 =: (6!:1) '' NB. time since engine start
'words nums' =: (~. words) ; words +//. nums NB. *
'words nums' =: (\: nums)& { &.:>"_1 words ; nums
words =: ; nums < @ /:~/. words
t2 =: (6!:1) '' NB. time since engine start
text =: , words ,. TAB ,. (": ,. nums) ,. LF
erase 'words' ; 'nums'
text =: (#~ ~: & ' ') text
text fwrite fn_out
erase < 'text'
t3 =: (6!:1) '' NB. time since engine start
echo 'Read and parse input: ' , ": t1
echo 'Classify, sum, sort: ' , ": t2 - t1
echo 'Format and write output: ' , ": t3 - t2
echo 'Total time: ' , ": t3
echo ''
echo 'Finished. Waiting for a key...'
stdin ''
exit 0
Code above doesn't (yet) include any 9.04 features and runs OK with 9.03, but I found 9.04 slightly faster in general. I also found 9.04 a bit faster on Windows, it's opposite to what I have seen for 9.03 (script ran faster on Linux), let's shrug it off because of 9.04 beta status and/or my antique PC. Results below are for beta 9.04 on Windows 10 (RAM usage taken from Windows Task Manager):
> jconsole.exe llil4.ijs big1.txt big2.txt big3.txt out.txt
Read and parse input: 1.501
Classify, sum, sort: 2.09
Format and write output: 1.318
Total time: 4.909
Finished. Waiting for a key...
Peak working set (memory): 376,456K
There are 3 star-marked lines. To patch for 9.04 new features to enable parallelization, replace them with these counterparts:
{{ for. i. 3 do. 0 T. 0 end. }} ''
,&.>/"2 (-#pattern) ]\ ,;(read_file @:(; &pattern)) t.'' "0 fnames
'words nums' =: (~.t.'' words) , words +//. t.'' nums
As you see, 1st line replaces comment, 2nd and 3d lines require just minor touches. 2nd line launches reading and parsing of input files in parallel. 3d line says to parallelize filtering for unique words and summing numbers according to words classification. Kind of redundant double work, even as it was, as I see it. The 1st line starts 3 additional worker threads. I don't have more cores with my CPU anyway, and this script has no work easily dispatched to more workers. Then:
Read and parse input: 0.992
Classify, sum, sort: 1.849
Format and write output: 1.319
Total time: 4.16
I would call my parallelization attempt, however crude it was, a success. Next is output for our second "official" dataset in this thread:
> jconsole.exe llil4.ijs long1.txt long2.txt long3.txt out.txt
Read and parse input: 1.329
Classify, sum, sort: 0.149
Format and write output: 0.009
Total time: 1.487
########################################################
These are my results for latest C++ solution (compiled using g++), to compare my efforts to:
$ ./llil2vec_11149482 big1.txt big2.txt big3.txt >vec.tmp
llil2vec start
get_properties CPU time : 3.41497 secs
emplace set sort CPU time : 1.04229 secs
write stdout CPU time : 1.31578 secs
total CPU time : 5.77311 secs
total wall clock time : 5 secs
$ ./llil2vec_11149482 long1.txt long2.txt long3.txt >vec.tmp
llil2vec start
get_properties CPU time : 1.14889 secs
emplace set sort CPU time : 0.057158 secs
write stdout CPU time : 0.003307 secs
total CPU time : 1.20943 secs
total wall clock time : 2 secs
$ ./llil2vec_11149482 big1.txt big2.txt big3.txt >vec.tmp
llil2vec (fixed string length=6) start
get_properties CPU time : 2.43187 secs
emplace set sort CPU time : 0.853877 secs
write stdout CPU time : 1.33636 secs
total CPU time : 4.62217 secs
total wall clock time : 5 secs
I noticed that new C++ code, supposed to be faster, is actually slower (compared to llil2grt) with "long" dataset from two "official" datasets used in this thread. | [reply] [d/l] [select] |
|
More impressive analysis from our mysterious anonymonk.
I noticed that new C++ code, supposed to be faster, is actually slower (compared to llil2grt) with "long" dataset from two "official" datasets used in this thread
Good catch! I was so hyper-focused on trying to beat the lowest time for the simple
"big1.txt big2.txt big3.txt" test case that most folks were using,
I didn't even test the "long1.txt long2.txt long3.txt" test case. :(
With a few seconds thought it seems obvious that very long words will favour the hash-based approach over the vector-based one.
I'm pleased in a way because my hash-based solution is so much simpler and clearer than my vector-based one ... plus it
encourages us to explore different approaches.
I'm happy for folks to continue to claim the fastest time for the short string big1.txt big2.txt big3.txt test case.
That's like the 100 metre sprint.
We should add a marathon world record for fastest long1.txt long2.txt long3.txt long4.txt long5.txt long6.txt test case.
And maybe a 1500 metre world record for big1.txt big2.txt big3.txt long1.txt long2.txt long3.txt. :)
Update: Timings for these three Olympic Events
Run under Ubuntu on my modest laptop against the code here: Interim version of llil3grt.cpp and llil4vec-tbb.cpp.
llil4vec easily won the 100 metre sprint; llil3grt easily won the marathon; llil4vec narrowly won the decider, the 1500 metres.
--------------------- 100 metre sprint -------------------------------
+----
$ ./llil3grt big1.txt big2.txt big3.txt >grt.tmp
llil3grt start
get_properties CPU time : 2.63733 secs
emplace set sort CPU time : 1.1302 secs
write stdout CPU time : 1.17949 secs
total CPU time : 4.94712 secs
$ NUM_THREADS=6 ./llil4vec-tbb big1.txt big2.txt big3.txt >4vec.tmp
llil4vec-tbb (fixed string length=6) start
use TBB
get properties time : 0.395106 secs
sort properties time : 0.325554 secs
emplace set sort time : 0.701055 secs
write stdout time : 0.541473 secs
total time : 1.96334 secs
--------------------------------- 1500 metre -------------------------
+----
$ ./llil3grt big1.txt big2.txt big3.txt long1.txt long2.txt long3.txt
+ >big-long-grt.tmp
llil3grt start
get_properties CPU time : 3.13399 secs
emplace set sort CPU time : 1.0814 secs
write stdout CPU time : 1.30358 secs
total CPU time : 5.51907 secs
$ NUM_THREADS=6 ./llil4vec-tbb big1.txt big2.txt big3.txt long1.txt lo
+ng2.txt long3.txt >big-long-4vec.tmp
llil4vec-tbb start
use TBB
get properties time : 1.05054 secs
sort properties time : 1.13865 secs
emplace set sort time : 1.16597 secs
write stdout time : 0.449603 secs
total time : 3.80495 secs
---------------------------------- marathon --------------------------
+----
$ ./llil3grt long1.txt long2.txt long3.txt long4.txt long5.txt long6.t
+xt >long-long-3grt.tmp
llil3grt start
get_properties CPU time : 0.846717 secs
emplace set sort CPU time : 0.005067 secs
write stdout CPU time : 0.002561 secs
total CPU time : 0.854441 secs
$ NUM_THREADS=6 ./llil4vec-tbb long1.txt long2.txt long3.txt long4.txt
+ long5.txt long6.txt >long-long-4vec.tmp
llil4vec-tbb start
use TBB
get properties time : 2.36002 secs
sort properties time : 1.28398 secs
emplace set sort time : 0.118658 secs
write stdout time : 0.001721 secs
total time : 3.76457 secs
---------------------------------------------
| [reply] [d/l] [select] |
|
What follows are mostly my attempts to describe and explain i.e. fair amount of prose, perhaps not worth a copper (insert name for local coin of lowest denomination).
First of all, thanks for warm words marioroy, I'm glad you liked this cute toy. I think you'll be particularly interested in parallelization, how they do it in J. And BTW the key to press is Ctrl-D, to "slurp" from STDIN.
Plus, to keep this node from being completely off-topic among Perl congregation -- some monks might also be interested in, that, if any FFI module for Perl is allowed to be used for work or play, it's very easy to execute J phrases, calling DLL and reading/writing packed Perl scalars as J data. It's been awhile since I had that fun, but, e.g., a link [1] to get started, it is about calling J from J REPL, but I remember those experiments were quite useful for learning. Whoever gets interested will easily find more tutorials.
In my 1st script, I actually posted code which is borderline buggy, and felt uneasy about it since previous year:
text =: , freads " 0 fn_in NB. wrong
text =: ; < @ freads " 0 fn_in NB. fixed
What's right of comma in 1st line, reads into N x M character table (then comma converts it into 1D list), number of files by longest file size, with short files padded. If files were not same size, result would be erroneous. Further, if 100 files were 1 MB, and 101-st file is 1 GB, then this would end with "out of RAM" or similar error. Neither line is used in modified code, but with this off my chest, let's move on.
I noticed that very slow part of my solution was (and still is) reading and parsing. Moreover this trivial part (not subsequent sorting!) is main RAM consumer during script run. Whatever I tried didn't help much. Then I decided if that's so -- let's make it at least potentially useful for the future.
Instead of ad hoc code to consume strictly 2-column input, I wrote a function ("verb") to read from (well-formed) TSV file with any number of columns, with a "pattern" provided to describe how to read columns, as either text or numbers. In our case, pattern is obviously 0 1. To keep script interface consistent, this value is hard-coded, but should be e.g. CLI argument, of course.
Part of what is a "well-formed" TSV file for this verb, is that "text" doesn't contain ASCII-32 space character. You maybe noticed I commented-out PAD_CHAR definition, because changing it won't be very useful. Some more substantial changes would be required to allow ASCII-32 (and then to pad with ASCII-0, perhaps), because frame-fill for literal data type is space character, nothing else. I'm keeping the line commented-out as a reminder.
Which leads us to the essential question -- why keep textual data as padded N x M table, instead of keeping strings of text (they may be any different lengths then) each in their own "box"? Box in J is atomic data type and sort of pointer to whatever is inside. Data structures of any complexity are implemented with nested boxes.
The answer is -- because boxes are relatively expensive [2]. A few thousands (or 1e5 or so) are OK. An array of 10 million boxes (total number of lines in our dataset) easily eats a few GB of RAM and is very slow.
This also explains why I didn't use DSV addon [3] from standard library. For similar reasons (while I'm at explaining this stage of my route of decision making) I decided against symbols [4] (irreducible and shared GST (global symbol table)? Don't know what to think of it).
So, textual columns are parsed into padded 2D tables, but then I saw that slurping all files into single string prior to parsing is not required at all, if files are not too tiny. And indeed, RAM requirements for the script dropped significantly then. Further, parsing file by file makes obvious choice as to where to try to parallelize.
To finish with data input, the CRs are filtered out, in modified script, unconditionally, and, also unconditionally, NOT injected on output, in both OSes. And BTW, for fread and fwrite from standard library there exist their counterparts freads and fwrites to strip CRs. Don't know why, they are slow compared to manual filtering. The latter many seconds slower, not even used in original script. The modified script got rid of freads, also.
Moving on to how sorting was done in 1st script version -- actually with somewhat complicated and slow code, trying to sort word sub-ranges (per the same numeric count) without creating intermediate boxed arrays. In modified script, this is replaced with just a short simple phrase. As long as there are less than a few millions of unique counts, this would be OK w.r.t. RAM usage (but 1st version would get very slow then anyway).
This leads to new verb in 9.04 version (Key dyad [5]). I thought using it would help to "pack" pairs of words and numbers; then do kind of "packed sort" trick. It turned out to be a few times slower, I didn't investigate and I abandoned this route.
Moving to data output, requiring user to provide "numeric width" or whatever it was called was quite silly on my part; computing decimal logarithm of maximum would be OK :) And even so, it turned out that J formats numeric column (as opposed to row i.e. 1D array) automatically padded as required.
My final thoughts would be that multithreading is new feature in 9.04 version, as I already said. Documentation says [6]:
Even if your code does not create tasks, the system can take advantage of threadpool 0 to make primitives run faster
At first I thought it means that simply spawning workers would be enough for J to run parallel. Kind of how e.g. PDL does [7] (or tries to do). It turned out to be not so, either because script is very simple, or this automatic parallelization is not yet enabled in beta. We'll see.
1. https://code.jsoftware.com/wiki/Guides/DLLs/Calling_the_J_DLL
2. https://www.jsoftware.com/help/jforc/performance_measurement__tip.htm#_Toc191734575
3. https://code.jsoftware.com/wiki/Addons/tables/dsv
4. https://code.jsoftware.com/wiki/Vocabulary/sco
5. https://code.jsoftware.com/wiki/Vocabulary/slashdot#dyadic
6. https://code.jsoftware.com/wiki/Vocabulary/tcapdot#dyadic
7. https://metacpan.org/dist/PDL/view/Basic/Pod/ParallelCPU.pod
| [reply] [d/l] [select] |
|
E-hm... hello? Are we still playing?
Long thread is long. Should have known better before even beginning to think to start mumbling "paralle..li...", because of massive barrage of fire that ensued immediately after :). Hoping I chose correct version to test, and my dated PC (number of workers in particular) is poor workbench, but:
$ time ./llil3vec_11149482 big1.txt big2.txt big3.txt >vec6.tmp
llil3vec (fixed string length=6) start
get_properties CPU time : 1.80036 secs
emplace set sort CPU time : 0.815786 secs
write stdout CPU time : 1.39233 secs
total CPU time : 4.00856 secs
total wall clock time : 4 secs
real 0m4.464s
user 0m3.921s
sys 0m0.445s
$ time ./llil3vec_11149482_omp big1.txt big2.txt big3.txt >vec6.tmp
llil3vec (fixed string length=6) start
get_properties CPU time : 2.06675 secs
emplace set sort CPU time : 0.94937 secs
write stdout CPU time : 1.40311 secs
total CPU time : 4.41929 secs
total wall clock time : 4 secs
real 0m3.861s
user 0m4.356s
sys 0m0.493s
----------------------------------------------
Then I sent my workers to retirement to plant or pick flowers or something i.e. (temporarily) reverted to single-threaded code, walked around (snow, no flowers), made a few changes, here's comparing previous and new versions:
$ time ../j903/bin/jconsole llil4.ijs big1.txt big2.txt big3.txt out_j
+.txt
Read and parse input: 1.6121
Classify, sum, sort: 2.23621
Format and write output: 1.36701
Total time: 5.21532
real 0m5.220s
user 0m3.934s
sys 0m1.195s
$ time ../j903/bin/jconsole llil5.ijs big1.txt big2.txt big3.txt out_j
+.txt
Read and parse input: 1.40811
Classify, sum, sort: 1.80736
Format and write output: 0.373946
Total time: 3.58941
real 0m3.594s
user 0m2.505s
sys 0m0.991s
$ diff vec6.tmp out_j.txt
$
New script:
NB. -----------------------------------------------------------
NB. --- This file is "llil5.ijs"
NB. --- Run as e.g.:
NB.
NB. jconsole.exe llil5.ijs big1.txt big2.txt big3.txt out.txt
NB.
NB. --- (NOTE: last arg is output filename, file is overwritten)
NB. -----------------------------------------------------------
pattern =: 0 1
args =: 2 }. ARGV
fn_out =: {: args
fn_in =: }: args
filter_CR =: #~ ~: & CR
read_file =: {{
'fname pattern' =. y
text =. TAB, filter_CR fread fname
text =. TAB (I. text = LF) } text
selectors =. I. text = TAB
width =. # pattern
height =. width <. @ %~ # selectors
append_diffs =. }: , 2& (-~/\)
shuffle_dims =. (1 0 3 & |:) @ ((2, height, width, 1) & $)
selectors =. append_diffs selectors
selectors =. shuffle_dims selectors
literal =. < @: (}."1) @: (];. 0) & text "_1
numeric =. < @: (0&".) @: (; @: (<;. 0)) & text "_1
extract =. pattern & {
using =. 1 & \
or_maybe =. `
,(extract literal or_maybe numeric) using selectors
}}
read_many_files =: {{
'fnames pattern' =. y
,&.>/"2 (-#pattern) ]\ ,(read_file @:(; &pattern)) "0 fnames
}}
'words nums' =: read_many_files fn_in ; pattern
t1 =: (6!:1) '' NB. time since engine start
idx =: i.~ words
nums =: idx +//. nums
idx =: nums </. ~. idx
words =: (/:~ @: { &words)&.> idx
erase < 'idx'
nums =: ~. nums
'words nums' =: (\: nums)& { &.:>"_1 words ; nums
t2 =: (6!:1) '' NB. time since engine start
text =: ; words (, @: (,"1 _))&.(>`a:)"_1 TAB ,. (": ,. nums) ,. LF
erase 'words' ; 'nums'
text =: (#~ ~: & ' ') text
text fwrite fn_out
erase < 'text'
t3 =: (6!:1) '' NB. time since engine start
echo 'Read and parse input: ' , ": t1
echo 'Classify, sum, sort: ' , ": t2 - t1
echo 'Format and write output: ' , ": t3 - t2
echo 'Total time: ' , ": t3
exit 0
echo ''
echo 'Finished. Waiting for a key...'
stdin ''
exit 0
----------------------------------------------
I don't know C++ "tools" chosen above ("modules" or whatever they called) at all; is capping the length to "6" in code just matter of convenience; any longer value could be hard-coded instead, like "12" or "25" (with obvious other fixes)? I mean, no catastrophic (cubic, etc.) slow-down would happen to sorting after some threshold? Therefore forcing to comment-out the define and use alternative set of "tools"? Perhaps input would be slower if cutting to unequally long words is expected?
Anyway, here's output if the define is commented-out:
$ time ./llil3vec_11149482_no6 big1.txt big2.txt big3.txt >vec6.tmp
llil3vec start
get_properties CPU time : 3.19387 secs
emplace set sort CPU time : 0.996694 secs
write stdout CPU time : 1.32918 secs
total CPU time : 5.5198 secs
total wall clock time : 6 secs
real 0m6.088s
user 0m5.294s
sys 0m0.701s
$ time ./llil3vec_11149482_no6_omp big1.txt big2.txt big3.txt >vec6.tm
+p
llil3vec start
get_properties CPU time : 3.99891 secs
emplace set sort CPU time : 1.13424 secs
write stdout CPU time : 1.41112 secs
total CPU time : 6.54723 secs
total wall clock time : 4 secs
real 0m4.952s
user 0m6.207s
sys 0m0.842s
Should my time be compared to them? :) (Blimey, my solution doesn't have to compete when participants are capped selectively (/grumpy_on around here)). Or I can use powerful magic secret turbo mode:
turbo_mode_ON =: {{
assert. 0 <: c =. 8 - {: $y
h =. (3 (3!:4) 16be2), ,|."1 [3 (3!:4)"0 (4:,#,1:,#) y
3!:2 h, ,y ,"1 _ c # ' '
}}
turbo_mode_OFF =: {{
(5& }. @: (_8& (]\)) @: (2& (3!:1))) &.> y
}}
Inject these definitions, and these couple lines immediately after t1 =: and before t2 =: respectively:
words =: turbo_mode_ON words
words =: turbo_mode_OFF words
Aha:
$ time ../j903/bin/jconsole llil5.ijs big1.txt big2.txt big3.txt out_j
+.txt
Read and parse input: 1.40766
Classify, sum, sort: 1.24098
Format and write output: 0.455868
Total time: 3.1045
real 0m3.109s
user 0m1.815s
sys 0m1.210s
(and no cutting to pieces of pre-defined equal length was used ...yet) :)
----------------------------------------------
I can revert to parallel reading/parsing anytime, with effect as shown in parent node. As implemented, it was kind of passive; but files can be unequal sizes, or just one huge single file. I think serious solution would probe inside to find newlines at approx. addresses, then pass chunks coords to workers to parse in parallel.
Puny 2-workers attempt to sort, in parent, was just kind of #pragma omp parallel sections... thing with 2 sections; no use to send bus-loads of workers and expect quiet fans. There's some hope for "parallelizable primitives" in release (not beta) 9.04 or later. Maybe it's long time to wait. Or, if I could write code to merge 2 sorted arrays faster than built-in primitive sorts any of the halves -- then, bingo, I have multi-threaded fast merge-sort. But no success yet, the built-in sorts one large array faster, in single-thread.
| [reply] [d/l] [select] |
|
What a delight for our Anonymonk friend to come back. Thanks to you, we tried parallel :).
... but files can be unequal sizes, or just one huge single file. I think serious solution would probe inside to find newlines at approx. addresses, then pass chunks coords to workers to parse in parallel.
Chuma mentions 2,064 input files in the initial "Long list is long" thread. Processing a list of files in parallel is suited for this use case due to many files. Back in 2014, I wrote utilities that support both chunking and list modes; mce_grep and egrep.pl via --chunk-level={auto|file|list}.
llil5p.ijs
I took llil5.ijs and created a parallel version named llil5p.ijs, based on code-bits from your prior post. The number of threads can be specified via the NUM_THREADS environment variable.
$ diff -u llil5.ijs llil5p.ijs
--- llil5.ijs 2023-01-18 09:25:14.041515970 -0600
+++ llil5p.ijs 2023-01-18 09:25:58.889669110 -0600
@@ -9,6 +9,12 @@
pattern =: 0 1
+nthrs =: 2!:5 'NUM_THREADS' NB. get_env NUM_THREADS
+{{
+ if. nthrs do. nthrs =: ".nthrs end. NB. string to integer conversio
+n
+ for. i. nthrs do. 0 T. 0 end. NB. spin nthrs
+}} ''
+
args =: 2 }. ARGV
fn_out =: {: args
fn_in =: }: args
@@ -44,7 +50,7 @@
read_many_files =: {{
'fnames pattern' =. y
- ,&.>/"2 (-#pattern) ]\ ,(read_file @:(; &pattern)) "0 fnames
+ ,&.>/"2 (-#pattern) ]\ ,;(read_file @:(; &pattern)) t.'' "0 fnames
}}
'words nums' =: read_many_files fn_in ; pattern
llil5tp.ijs
Next, I applied the turbo update to the parallel version and named it llil5tp.ijs.
$ diff -u llil5p.ijs llil5tp.ijs
--- llil5p.ijs 2023-01-18 09:25:58.889669110 -0600
+++ llil5tp.ijs 2023-01-18 09:26:01.553736512 -0600
@@ -21,6 +21,16 @@
filter_CR =: #~ ~: & CR
+turbo_mode_ON =: {{
+ assert. 0 <: c =. 8 - {: $y
+ h =. (3 (3!:4) 16be2), ,|."1 [3 (3!:4)"0 (4:,#,1:,#) y
+ 3!:2 h, ,y ,"1 _ c # ' '
+}}
+
+turbo_mode_OFF =: {{
+ (5& }. @: (_8& (]\)) @: (2& (3!:1))) &.> y
+}}
+
read_file =: {{
'fname pattern' =. y
@@ -56,6 +66,7 @@
'words nums' =: read_many_files fn_in ; pattern
t1 =: (6!:1) '' NB. time since engine start
+words =: turbo_mode_ON words
idx =: i.~ words
nums =: idx +//. nums
@@ -65,6 +76,7 @@
nums =: ~. nums
'words nums' =: (\: nums)& { &.:>"_1 words ; nums
+words =: turbo_mode_OFF words
t2 =: (6!:1) '' NB. time since engine start
text =: ; words (, @: (,"1 _))&.(>`a:)"_1 TAB ,. (": ,. nums) ,. LF
| [reply] [d/l] [select] |
|
|