Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Re: Long list is long

by marioroy (Prior)
on Feb 10, 2023 at 03:46 UTC ( [id://11150293]=note: print w/replies, xml ) Need Help??


in reply to Long list is long

Greetings, Chuma

Eyepopslikeamosquito started a thread here. The thread is quite long containing many solutions. Most of them compute and store the map table in memory. You mentioned having 2,064 list files, up to a couple hundred MB each. If your data set contains billions of unique key-value pairs, most of the solutions may fail due to insufficient memory capacity.

We created a backup plan (well, there are two). One solution requires lesser memory consumption and can handle hundreds of list files, possibly thousands. That is parsort from GNU Parallel and tally-count. The instructions for building tally-count.cpp is in the source.

Do not run the default sort utility on the box and instead run parsort (as shown) for minimum memory consumption. At the time of writing, it is best to have parsort read STDIN when processing hundreds of files. Files specified as arguments causes parsort to create two processes per file.

cat list1.txt list2.txt ... listN.txt |\ LC_ALL=C parsort --parallel=24 -k1 | /path/to/tally-count |\ LC_ALL=C parsort --parallel=24 -k2nr > out.txt

The output from the first sort command contains duplicate key names. Next, the tally-count command sums adjacent count fields of duplicate names. Its output contains unique names. Finally, parsort is called again to sort by sum in descending order and keyname in ascending order.

TMPDIR available space.

Processing two thousand list files via parsort may cause $TMPDIR (/tmp) to fill up. You can set the TMPDIR environment variable to a location with ample storage or specify the -T option. Ensure 1.2x to 2x available temp-storage of the combined input size.

cat list1.txt list2.txt ... listN.txt |\ LC_ALL=C parsort --parallel=24 -T /path/to/tmpdir -k1 | /path/to/tally +-count |\ LC_ALL=C parsort --parallel=24 -T /path/to/tmpdir -k2nr > final.txt

I made a parsort variant named mcesort.

cat list1.txt list2.txt ... listN.txt |\ LC_ALL=C mcesort --parallel=24 -T /path/to/tmpdir -k1 | /path/to/tally +-count |\ LC_ALL=C mcesort --parallel=24 -T /path/to/tmpdir -k2nr > final.txt

For extra performance, the --tally option integrates the tally command with merge. The effect is lesser work for subsequent merge, beneficial when there are many duplicate keys. The -A option sets LC_ALL=C. The -j24 option is short for --parallel=24.

cat list1.txt list2.txt ... listN.txt |\ mcesort -A -j24 -T /path/to/tmpdir --tally=/path/to/tally-count -k1 |\ mcesort -A -j24 -T /path/to/tmpdir -k2nr > final.txt

Noteworthy C++ solutions.

You can run llil4map, llil4map2 (memory efficient), llil4hmap, or llil4emh. Temporary storage isn't used and there's no limit to the number of list files. Are the keys limited-length, let's say 8 ~ 20? Then, uncomment the line "#define MAX_STR_LEN_L" and adjust the value accordingly. That will minimize memory consumption.

NUM_THREADS=24 llil4map list1.txt list2.txt ... listN.txt > final.txt NUM_THREADS=24 llil4hmap list1.txt list2.txt ... listN.txt > final.txt NUM_THREADS=24 llil4emh list1.txt list2.txt ... listN.txt > final.txt

If you're processing a hundred or less files (~ 200MB each; again, depending on memory capacity ~ 50GB), the llil4vec solution favors time over space, based on llil3vec. Be sure to uncomment the line "#define MAX_STR_LEN_L" for limited-length keys; e.g. 20 or less. See also, llil4vec2.

NUM_THREADS=24 llil4vec list1.txt list2.txt ... listN.txt > final.txt

July 2023 alternative Perl/C++ solutions. Results, refer to C++ vs Perl.

llilsql.pl   - Tie::Hash::DBD SQLite database
llildbf.pl   - DB_File B-tree database
llilmma.pl   - IPC::MMA shared memory hash
lliltch.pl   - Tokyo Cabinet hash database
llilkch.pl   - Kyoto Cabinet hash database
llil4tkh.cc  - tkrzw::ShardDBM demonstration
llil4tkh2.cc - tkrzw::HashDBM demonstration

March 2024 updates.

1. Reported issues to NVIDIA regarding nvc++. Greg fixed warnings
   in his phmap library. No more warnings using nvc++.

2. Use spinlock_mutex found on the web, versus std::mutex.
   This improves llil4map, llil4hmap, and llil4emh.
   Tested g++, clang++, and nvc++.
   https://github.com/clearlinux/distribution/issues/3036

3. Greg, the author of the phmap C++ library, provided suggestion for
   ensuring minimum memory consumption during "map to vector", in llil4map.
   https://github.com/greg7mdp/parallel-hashmap/issues/238

4. The llil4map demonstration lives at the phmap C++ repository.
   examples/llil4map.cc, and associated llil_utils folder
   https://github.com/greg7mdp/parallel-hashmap

5. Applied Greg's minor simplification for parallel sort to C++ demonstrations.
   Updated "map to vector" in Gist llil4map.cc to simplified version.
   https://github.com/greg7mdp/parallel-hashmap/commit/38dd6b52940937cae06191b7fb39ad0d6aeea4d2

April 2024 updates.

1. Added llil4map2, a memory efficient version.
   https://github.com/greg7mdp/parallel-hashmap/issues/238#issuecomment-2036284574

2. A more memory efficient version, by Gregory Popovitch, author of the C++ parallel-hashmap library.
   https://github.com/greg7mdp/parallel-hashmap/issues/238#issuecomment-2039591745

See also, The Long List is Long resurrected.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11150293]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others musing on the Monastery: (3)
As of 2025-07-17 03:19 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?
    erzuuliAnonymous Monks are no longer allowed to use Super Search, due to an excessive use of this resource by robots.