![]() |
|
Don't ask to ask, just ask | |
PerlMonks |
Re: Long list is longby marioroy (Prior) |
on Feb 10, 2023 at 03:46 UTC ( [id://11150293]=note: print w/replies, xml ) | Need Help?? |
Greetings, Chuma Eyepopslikeamosquito started a thread here. The thread is quite long containing many solutions. Most of them compute and store the map table in memory. You mentioned having 2,064 list files, up to a couple hundred MB each. If your data set contains billions of unique key-value pairs, most of the solutions may fail due to insufficient memory capacity. We created a backup plan (well, there are two). One solution requires lesser memory consumption and can handle hundreds of list files, possibly thousands. That is parsort from GNU Parallel and tally-count. The instructions for building tally-count.cpp is in the source. Do not run the default sort utility on the box and instead run parsort (as shown) for minimum memory consumption. At the time of writing, it is best to have parsort read STDIN when processing hundreds of files. Files specified as arguments causes parsort to create two processes per file.
The output from the first sort command contains duplicate key names. Next, the tally-count command sums adjacent count fields of duplicate names. Its output contains unique names. Finally, parsort is called again to sort by sum in descending order and keyname in ascending order. TMPDIR available space. Processing two thousand list files via parsort may cause $TMPDIR (/tmp) to fill up. You can set the TMPDIR environment variable to a location with ample storage or specify the -T option. Ensure 1.2x to 2x available temp-storage of the combined input size.
I made a parsort variant named mcesort.
For extra performance, the --tally option integrates the tally command with merge. The effect is lesser work for subsequent merge, beneficial when there are many duplicate keys. The -A option sets LC_ALL=C. The -j24 option is short for --parallel=24.
Noteworthy C++ solutions. You can run llil4map, llil4map2 (memory efficient), llil4hmap, or llil4emh. Temporary storage isn't used and there's no limit to the number of list files. Are the keys limited-length, let's say 8 ~ 20? Then, uncomment the line "#define MAX_STR_LEN_L" and adjust the value accordingly. That will minimize memory consumption.
If you're processing a hundred or less files (~ 200MB each; again, depending on memory capacity ~ 50GB), the llil4vec solution favors time over space, based on llil3vec. Be sure to uncomment the line "#define MAX_STR_LEN_L" for limited-length keys; e.g. 20 or less. See also, llil4vec2.
July 2023 alternative Perl/C++ solutions. Results, refer to C++ vs Perl. llilsql.pl - Tie::Hash::DBD SQLite database llildbf.pl - DB_File B-tree database llilmma.pl - IPC::MMA shared memory hash lliltch.pl - Tokyo Cabinet hash database llilkch.pl - Kyoto Cabinet hash database llil4tkh.cc - tkrzw::ShardDBM demonstration llil4tkh2.cc - tkrzw::HashDBM demonstration March 2024 updates. 1. Reported issues to NVIDIA regarding nvc++. Greg fixed warnings in his phmap library. No more warnings using nvc++. 2. Use spinlock_mutex found on the web, versus std::mutex. This improves llil4map, llil4hmap, and llil4emh. Tested g++, clang++, and nvc++. https://github.com/clearlinux/distribution/issues/3036 3. Greg, the author of the phmap C++ library, provided suggestion for ensuring minimum memory consumption during "map to vector", in llil4map. https://github.com/greg7mdp/parallel-hashmap/issues/238 4. The llil4map demonstration lives at the phmap C++ repository. examples/llil4map.cc, and associated llil_utils folder https://github.com/greg7mdp/parallel-hashmap 5. Applied Greg's minor simplification for parallel sort to C++ demonstrations. Updated "map to vector" in Gist llil4map.cc to simplified version. https://github.com/greg7mdp/parallel-hashmap/commit/38dd6b52940937cae06191b7fb39ad0d6aeea4d2 April 2024 updates. 1. Added llil4map2, a memory efficient version. https://github.com/greg7mdp/parallel-hashmap/issues/238#issuecomment-2036284574 2. A more memory efficient version, by Gregory Popovitch, author of the C++ parallel-hashmap library. https://github.com/greg7mdp/parallel-hashmap/issues/238#issuecomment-2039591745 See also, The Long List is Long resurrected.
In Section
Seekers of Perl Wisdom
|
|