I've long found it fun to implement the same algorithm in different languages,
especially Perl and C++ ...
and then sit back and reflect on the lessons learned ... so
when Long list is long appeared recently, I felt it was
short and interesting enough to make an excellent Rosetta code node.
Solutions to this problem must read a number of input LLiL-format files (given as command line arguments)
and write a single merged LLiL-format file to stdout.
The LLiL-format is described in the comments at the top of llil.pl below.
In the interests of keeping the code as short and fast as possible,
you may assume the input LLiL files are well-formed.
For example, you don't need to check for and remove leading and trailing whitespace on each line.
The sample solutions given below in Perl and C++ should clarify program requirements.
Please feel free to respond away with solutions to this problem in your favourite programming language
and to offer suggested improvements to my sample Perl and C++ solutions below.
Perl Solution
Here's my Perl solution, heavily influenced by responses to Long list is long,
especially kcott's concise and clear solution:
# llil.pl
# Example run: perl llil.pl tt1.txt tt2.txt >oo1.tmp
use strict;
use warnings;
# --------------------------------------------------------------------
+--
# LLiL specification
# ------------------
# A LLiL-format file is a text file.
# Each line consists of a lowercase name a TAB character and a non-neg
+ative integer count.
# That is, each line must match : ^[a-z]+\t\d+$
# For example, reading the LLiL-format files, tt1.txt containing:
# camel\t42
# pearl\t94
# dromedary\t69
# and tt2.txt containing:
# camel\t8
# hello\t12345
# dromedary\t1
# returns this hashref:
# $hash_ret{"camel"} = 50
# $hash_ret{"dromedary"} = 70
# $hash_ret{"hello"} = 12345
# $hash_ret{"pearl"} = 94
# That is, values are added for items with the same key.
#
# To get the required LLiL text, you must sort the returned hashref
# descending by value and insert a TAB separator:
# hello\t12345
# pearl\t94
# dromedary\t70
# camel\t50
# To make testing via diff easier, we further sort ascending by name
# for lines with the same value.
# --------------------------------------------------------------------
+--
# Function get_properties
# Read a list of LLiL-format files
# Return a reference to a hash of properties
sub get_properties
{
my $files = shift; # in: reference to a list of LLiL-format fil
+es
my %hash_ret; # out: reference to a hash of properties
for my $fname ( @{$files} ) {
open( my $fh, '<', $fname ) or die "error: open '$fname': $!";
while (<$fh>) {
chomp;
my ($word, $count) = split /\t/;
$hash_ret{$word} += $count;
}
close($fh) or die "error: close '$fname': $!";
}
return \%hash_ret;
}
# ----------------- mainline -----------------------------------------
+--
@ARGV or die "usage: $0 file...\n";
my @llil_files = @ARGV;
warn "llil start\n";
my $tstart1 = time;
my $href = get_properties( \@llil_files );
my $tend1 = time;
my $taken1 = $tend1 - $tstart1;
warn "get_properties : $taken1 secs\n";
my $tstart2 = time;
for my $key ( sort { $href->{$b} <=> $href->{$a} || $a cmp $b } keys %
+{$href} ) {
print "$key\t$href->{$key}\n";
}
my $tend2 = time;
my $taken2 = $tend2 - $tstart2;
my $taken = $tend2 - $tstart1;
warn "sort + output : $taken2 secs\n";
warn "total : $taken secs\n";
What makes this problem interesting to me is the requirement
to sort the hash in descending order by value:
sort { $href->{$b} <=> $href->{$a} || $a cmp $b } keys %{$href}
because the performance of such a sort may suffer when
dealing with huge files (after all, performance was the reason for the OP's
question in the first place).
I'm hoping solving this problem in multiple languages will
be fun and instructive -- and perhaps give us insight into how
performance changes as the number of items increases.
C++ Solution
Here's my C++ solution:
// llil.cpp. C++ 11 version of Perl llil.pl.
// g++ compile on Linux:
// g++ -o llil -std=c++11 -Wall -O3 llil.cpp
// This g++ command also works with mingw C++ compiler (https://source
+forge.net/projects/mingw-w64)
// that comes bundled with Strawberry Perl (C:\Strawberry\c\bin\g++.ex
+e).
// Example run: llil tt1.txt tt2.txt >out.txt
// Uncomment next line to sort by creating a multimap (instead of via
+the sort function)
// #define LLIL_SORT_VIA_MULTIMAP_L 1
#include <cstddef>
#include <cstdint>
#include <cstdlib>
#include <ctime>
#include <cstdio>
#include <string>
#include <vector>
#include <set>
#include <map>
#include <unordered_map>
#include <algorithm>
#include <utility>
#include <iostream>
#include <fstream>
#include <sstream>
// ------------------------------------------------------------
// Performance hacks to speed up IO.
// See https://www.reddit.com/r/rust/comments/9xedap/how_to_achieve_fa
+st_stdinstdout_io_suitable_for/
// Avoid flush by using "\n" instead of std::endl (this made a big dif
+ference in this program!)
// This one made almost no difference in this program:
// const auto io_speed_up =[](){
// std::ios::sync_with_stdio(false);
// std::cin.tie(nullptr);
// return nullptr;
// }();
// ------------------------------------------------------------
using str_int_type = std::pair<std::string, int>;
using map_str_int_type = std::unordered_map<std::string, int>;
using vec_str_int_type = std::vector<str_int_type>;
// Mimic the Perl get_properties subroutine
static void get_properties(
int nfiles, // in: the number of input files
char* fname[], // in: the input file names
map_str_int_type& hash_ret) // out: a hash of properties
{
for (int i = 0; i < nfiles; ++i) {
std::ifstream llfile(fname[i]);
if (!llfile) {
std::cerr << "Error opening '" << fname[i] << "'\n";
return;
}
for (std::string line; std::getline(llfile, line); ) {
std::string word, count;
std::stringstream ssline(line);
std::getline(ssline, word, '\t');
std::getline(ssline, count);
hash_ret[word] += std::stoi(count);
}
}
}
#ifdef LLIL_SORT_VIA_MULTIMAP_L
// Convert a std::unordered_map<key, value> to a std::multimap<value,
+key>
static std::multimap<int, std::string> invert_map(const std::unordered
+_map<std::string, int>& m)
{
std::multimap<int, std::string> mm;
for (std::unordered_map<std::string, int>::const_iterator it = m.be
+gin(); it != m.end(); ++it) {
mm.insert( std::make_pair(it->second, it->first) );
}
return mm;
}
#endif
int main(int argc, char* argv[])
{
if (argc < 2) {
std::cerr << "usage: llil file1 file2 ... >out.txt\n";
return 1;
}
#ifdef LLIL_SORT_VIA_MULTIMAP_L
std::cerr << "llil start (multimap version)\n";
#else
std::cerr << "llil start (sort version)\n";
#endif
time_t tstart1 = ::time(NULL);
// Create the hash of properties
map_str_int_type hash_ret;
get_properties(argc - 1, &argv[1], hash_ret);
time_t tend1 = ::time(NULL);
long taken1 = static_cast<long>(::difftime(tend1, tstart1) + 0.5);
std::cerr << "get_properties : " << taken1 << " secs\n";
// Sort descending by value, i.e. mimic this Perl code in C++:
// sort { $href->{$b} <=> $href->{$a} || $a cmp $b } keys %{$href
+}
time_t tstart2 = ::time(NULL);
#ifdef LLIL_SORT_VIA_MULTIMAP_L
std::multimap<int, std::string> newmap = invert_map(hash_ret);
for (std::multimap<int, std::string>::reverse_iterator it = newmap.
+rbegin(); it != newmap.rend(); ++it) {
std::cout << it->second << '\t' << it->first << "\n";
}
#else
vec_str_int_type v( hash_ret.begin(), hash_ret.end() );
std::sort( v.begin(), v.end(),
[](const str_int_type& left, const str_int_type& right) { return
+ right.second != left.second ? right.second < left.second : left.firs
+t < right.first; }
);
for ( const std::pair<const std::string, int>& n : v ) {
std::cout << n.first << '\t' << n.second << "\n";
}
#endif
time_t tend2 = ::time(NULL);
long taken2 = static_cast<long>(::difftime(tend2, tstart2) + 0.5);
long taken = static_cast<long>(::difftime(tend2, tstart1) + 0.5);
std::cerr << "sort + output : " << taken2 << " secs\n";
std::cerr << "total : " << taken << " secs\n";
return 0;
}
This is a straightforward translation of my Perl solution into C++ 11.
I chose a C++ std::unordered_map
because it seems the closest container to a Perl hash.
Unlike Perl, you can't sort a std::unordered_map on the fly in a one-liner.
So I tried two different ways to do it:
- Copy the std::unordered_map to a std::vector and sort it via the std::sort algorithm
- Copy the std::unordered_map<key, value> to a std::multimap<value, key> (see std::multimap)
Alternative suggestions welcome.
Testing Correctness and Measuring Performance
To get a feel for performance and correctness, I used this simple and crude Perl program gen-llil.pl:
# gen-llil.pl
# Crude program to generate a big LLiL test file to use in benchmarks
# perl gen-llil.pl big2.txt 200 3 - produces a test file with size = 3
+5,152,000 bytes
use strict;
use warnings;
use autodie;
{
my $ordmin = ord('a');
my $ordmax = ord('z') + 1;
# Generate a random word
sub gen_random_word {
my $word = shift; # word prefix
my $nchar = shift; # the number of random chars to append
for my $i (1 .. $nchar) {
$word .= chr( $ordmin + int( rand($ordmax - $ordmin) ) );
}
return $word;
}
}
sub create_test_file {
my $fname = shift;
my $count = shift;
my $wordlen = shift;
open( my $fh_out, '>', $fname );
for my $c ( 'aaa' .. 'zzz' ) {
for my $i (1 .. $count) {
print {$fh_out} gen_random_word( $c, $wordlen ) . "\t" . 1 .
+"\n";
}
}
}
my $outfile = shift;
my $count = shift;
my $wordlen = shift;
$outfile or die "usage: $0 outfile count wordlen\n";
$count or die "usage: $0 outfile count wordlen\n";
print "generating test file '$outfile' with count '$count'\n";
create_test_file($outfile, $count, $wordlen);
print "file size=", -s $outfile, "\n";
I generated three biggish test files (size 35,152,000 bytes) with:
perl gen-llil.pl big1.txt 200 3
perl gen-llil.pl big2.txt 200 3
perl gen-llil.pl big3.txt 200 3
then ran the Perl version:
perl llil.pl big1.txt big2.txt big3.txt >perl.tmp
and the C++ version:
llil big1.txt big2.txt big3.txt >cpp.tmp
and verified they produced the same result:
diff perl.tmp cpp.tmp
Performance Results
To get the ball rolling, here are my initial performance results for the Perl and C++ solutions given above
running Strawberry Perl on my Windows Laptop PC with 8 GB RAM.
Perl (max Windows Private Bytes for the Perl process was about 2,236,648K)
> perl llil.pl big1.txt big2.txt big3.txt >perl.tmp
llil start
get_properties : 11 secs
sort + output : 74 secs
total : 85 secs
C++ (max Windows Private Bytes for the llil process was about 1,176,048K)
> llil big1.txt big2.txt big3.txt >cpp.tmp
llil start (sort version)
get_properties : 9 secs
sort + output : 7 secs
total : 16 secs
Verify correctness:
> diff cpp.tmp perl.tmp
I'd also find it interesting and educational to compare solutions to this little problem in a variety of languages.
So please feel free to respond away in your favourite language!
Some Other Rosetta Code Nodes
References Added Later
Updated: Fixed a commented out for debugging line in llil.pl one minute after posting. :)
Mar 2023: added link to follow-up Rosetta Test: Long List is Long.
Re: Rosetta Code: Long List is Long
by choroba (Cardinal) on Nov 30, 2022 at 23:25 UTC
|
As usually, if speed is your concern, you can trade it for space.
#!/usr/bin/perl
use warnings;
use strict;
use feature qw{ say };
warn "start\n";
my $tstart1 = time;
my (%by_count, %by_word);
while (<>) {
chomp;
my ($k, $v) = split /\t/, $_, 2;
if (exists $by_word{$k}) {
$v += $by_word{$k};
delete $by_count{ $by_word{$k} }{$k};
}
undef $by_count{$v}{$k};
$by_word{$k} = $v;
}
my $tend1 = time;
warn "get properties: ", $tend1 - $tstart1, " secs\n";
my $tstart2 = time;
for my $count (sort { $b <=> $a } keys %by_count) {
say "$_\t$count" for sort keys %{ $by_count{$count} };
}
my $tend2 = time;
warn "sort + output: ", $tend2 - $tstart2, " secs\n";
warn "total: ", $tend2 - $tstart1, " secs\n";
Comparison?
llil start
get_properties : 13 secs
sort + output : 85 secs
total : 98 secs
start
get properties: 21 secs
sort + output: 25 secs
total: 46 secs
The output is identical.
map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
| [reply] [d/l] [select] |
|
get properties: 17 secs
sort + output: 18 secs
total: 35 secs
more than twice as fast as my original Perl version of 85 secs ...
with Windows Private Bytes increasing from 2,236,648K to 3,516,476K.
| [reply] [d/l] |
|
The following is my fun spin using dualvar, based on choroba.pl. I tried minimizing memory consumption by using one hash and one array. The memory consumption is similar to llil.pl and performance slightly faster than choroba.pl.
See also, parallel solution update.
#!/usr/bin/perl
use warnings;
use strict;
use feature qw{ say };
use Scalar::Util qw{ dualvar };
warn "start\n";
my $tstart1 = time;
our (%by_word, @data);
while (<>) {
chomp;
my ($k, $v) = split /\t/, $_;
$by_word{$k} += $v;
}
my $tend1 = time;
warn "get properties: ", $tend1 - $tstart1, " secs\n";
my $tstart2 = time;
while (my ($k, $v) = each %by_word) {
push @data, dualvar($v, $k);
}
# output array of dualvars; sorted by number desc, string asc
say "$_\t".(0+$_) for sort { $b <=> $a } sort { $a cmp $b } @data;
my $tend2 = time;
warn "sort + output: ", $tend2 - $tstart2, " secs\n";
warn "total: ", $tend2 - $tstart1, " secs\n";
| [reply] [d/l] |
|
sort { $b <=> $a || $a cmp $b } @data;
Before:
sort + output: 19 secs
After:
sort + output: 32 secs
map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
| [reply] [d/l] [select] |
|
|
my (%by_count, %by_word); # slow cleanup/exiting
$ time perl choroba.pl big1.txt big2.txt big3.txt >cpp.tmp
start
get properties: 21 secs
sort + output: 25 secs
total: 46 secs
real 1m2.072s
user 1m0.182s
sys 0m1.885s
our (%by_count, %by_word); # fast cleanup/exiting
$ time perl choroba.pl big1.txt big2.txt big3.txt >cpp.tmp
start
get properties: 21 secs
sort + output: 25 secs
total: 46 secs
real 0m47.062s
user 0m45.505s
sys 0m1.549s
| [reply] [d/l] [select] |
|
What Perl version do you run? The final garbage collecting takes about 6 secs on my Linux machine in both 5.26.1 and blead.
map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
| [reply] [d/l] |
|
Re: Rosetta Code: Long List is Long -- parallel
by marioroy (Prior) on Dec 01, 2022 at 06:37 UTC
|
Greetings, eyepopslikeamosquito
> Please feel free to respond away with solutions...
I made a parallel demonstration. The results are taken from a 32-core Linux box. On Windows, run Cygwin's Perl for best performance.
Results
# Modification: our (%by_count, %by_word);
$ time perl choroba_mod.pl big1.txt big2.txt big3.txt >oo1.txt
start
get properties: 9 secs
sort + output: 14 secs
total: 23 secs
real 0m 23.083s
user 0m 22.568s
sys 0m 0.491s
# Run parallel on physical cores: max_workers => 4
$ time taskset -c 0-31 perl mce.pl big1.txt big2.txt big3.txt >oo2.txt
start
get properties + pre-sort: 14 secs
final sort + output: 2 secs
total: 16 secs
real 0m 15.434s
user 0m 52.223s
sys 0m 0.824s
# Verify correctness:
$ diff cpp.tmp oo1.txt # choroba
$ diff cpp.tmp oo2.txt # MCE 4 workers
Parallel code
Basically, workers process and gather orderly letters "a" through "z". I was hoping to gather an array of dualvars, but forgotten serialization removes the numeric part.
| [reply] [d/l] [select] |
Re: Rosetta Code: Long List is Long
by Tux (Canon) on Dec 01, 2022 at 09:18 UTC
|
Not a real contestant, but a different approach using TSV/XS (just because I could not resist):
$ cat -te tt1.txt
camel^I42$
pearl^I94$
dromedary^I69$
$ cat -te tt2.txt
camel^I8$
hello^I12345$
dromedary^I1$
$ perl -MText::CSV_XS=csv -wE'csv(in=>*ARGV,sep=>"\t",out=>undef,on_in
+=>sub{$x{$_[1][0]}+=$_[1][1]});printf"%-20s %6d\n",$_,$x{$_}for sort
+keys%x' tt?.t
camel 50
dromedary 70
hello 12345
pearl 94
Enjoy, Have FUN! H.Merijn
| [reply] [d/l] |
Re: Rosetta Code: Long List is Long -- oneliner
by Discipulus (Canon) on Dec 01, 2022 at 08:00 UTC
|
Hello eyepopslikeamosquito,
> So please feel free to respond away in your favourite language!
If oneliner is a language..
cat lill01.txt
camel 42
pearl 94
dromedary 69
cat lill02.txt
camel 8
hello 12345
dromedary 1
perl 94
perl -lane "$r{$F[0]}+=$F[1]}{print qq($_\t$r{$_})for sort{$r{$b}<=>$r
+{$a}||$a cmp $b}keys %r" lill01.txt lill02.txt
# or shortened by one char because perl -a implies -n
perl -lae "$r{$F[0]}+=$F[1]}{print qq($_\t$r{$_})for sort{$r{$b}<=>$r{
+$a}||$a cmp $b}keys %r" lill01.txt lill02.txt
hello 12345
pearl 94
perl 94
dromedary 70
camel 50
Deparsed for newcomers..
L*
There are no rules, there are no thumbs..
Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.
| [reply] [d/l] [select] |
Re: Rosetta Code: Long List is Long - GNU Parallel
by marioroy (Prior) on Feb 09, 2023 at 09:05 UTC
|
The SQLite demonstrations were interesting. This is about GNU Parallel, which includes parsort. Something like this is possible:
Linux box:
Note: See this thread for generating the big input files.
LC_ALL=C sort -k1 big{1,2,3}.txt | awk -f tally-count.awk | LC_ALL=C s
+ort -k2nr >out.txt
real time: 12.671 seconds
LC_ALL=C parsort -k1 big{1,2,3}.txt | awk -f tally-count.awk | LC_ALL=
+C parsort -k2nr >out.txt
real time: 6.561 seconds
LC_ALL=C parsort -k1 big{1,2,3}.txt | ./tally-count | LC_ALL=C parsort
+ -k2nr >out.txt
real time: 3.615 seconds
Bigger box:
LC_ALL=C parsort -k1 big{1,2,3}.txt | awk -f tally-count.awk | LC_ALL=
+C parsort -k2nr >out.txt
real time: 6.520 seconds
LC_ALL=C parsort -k1 big{1,2,3}.txt | ./tally-count | LC_ALL=C parsort
+ -k2nr >out.txt
real time: 2.398 seconds
The big files are two column key-value pairs delimited by a tab. The output from the first sort command contains duplicate key names. Next, the tally-count command sums adjacent count fields of duplicate names. Its output contains unique names. Finally, parsort is called again to sort by sum in descending order and keyname in ascending order.
q - Text as Data:
I came across q - Run SQL directly on delimited files and multi-file sqlite databases on GitHub. Specify -tT for tab-delimited input and output. The dash indicates to read from standard input. The q tool is interesting.
cat big{1,2,3}.txt | ./linux-q -tT "select c1, sum(c2) from - group by
+ c1 order by sum(c2) desc, c1" >out.txt
real time: 84.027 seconds
./linux-q -tT "select c1, sum(c2) from big[1-3].txt group by c1 order
+by sum(c2) desc, c1" >out.txt
real time: 89.539 seconds
tally-count.awk:
The Awk script becomes a bottleneck for parallel sort. Notice above, no improvement on a bigger machine.
tally-count.cpp:
The C++ solution is noticeably faster. The great thing about the parsort/tally-count combination is being able to process hundreds of list files and not worry about exhausting memory capacity.
See also Grouped aggregate utility (like SQL GROUP BY)? on Stack Exchange.
GNU Parallel Citation
Tange, O. (2023, January 22). GNU Parallel 20230122 ('Bolsonaristas').
Zenodo. https://doi.org/10.5281/zenodo.7558957
| [reply] [d/l] [select] |
|
At the time of testing, I was hoping to specify the number of threads using parsort but was unable to in a consistent fashion. So, I created a wrapper script named "parallel" that is placed in a path (/usr/local/bin) before (/usr/bin) where "parallel" itself resides.
#!/usr/bin/env bash
# Wrapper script for parallel.
# Whoa!!! GNU Parallel assumes you want to consume all CPU cores.
# Unfortunately, one cannot specify the number of threads for parsort.
CMD="/usr/bin/parallel"
if [[ -z "$PARALLEL_NUM_THREADS" ]]; then
exec "$CMD" "$@"
elif [[ "$#" -eq 1 && "$1" == "--number-of-threads" ]]; then
echo $PARALLEL_NUM_THREADS; exit 0
elif [[ "$1" == "-j" ]]; then
shift; shift; exec "$CMD" -j $PARALLEL_NUM_THREADS "$@"
else
exec "$CMD" -j $PARALLEL_NUM_THREADS "$@"
fi
The environment variable prevents parsort (/usr/bin/parallel, behind the scene) from going semi-wild on a big machine with many CPU cores.
export PARALLEL_NUM_THREADS=12
LC_ALL=C parsort -k1 big{1,2,3}.txt | ./tally-count | LC_ALL=C parsort
+ -k2nr >out.txt
The GNU Parallel "parsort" / "tally-count" combination may be useful for Chuma in the originating Long list is long thread. Chuma wrote, "the files are pretty large (up to a couple hundred MB), there are quite a few of them (2064)." | [reply] [d/l] [select] |
Re: Rosetta Code: Long List is Long - SQLite code
by marioroy (Prior) on Feb 08, 2023 at 02:17 UTC
|
I tried a key-value store implementation using SQLite. There are two scripts; one for importing files and another to output the results.
Note: See this thread for generating the big input files.
$ wc -l big1.txt big2.txt big3.txt
3515200 big1.txt
3515200 big2.txt
3515200 big3.txt
10545600 total
$ time bash llil_sql_imp1 hash1.db big1.txt big2.txt big3.txt
Importing big1.txt
Importing big2.txt
Importing big3.txt
real 0m12.913s
user 0m12.701s
sys 0m0.197s
$ time bash llil_sql_qry1 hash1.db >out.txt
real 0m6.297s
user 0m6.157s
sys 0m0.158s
$ ls -lh hash1.db
-rw-r--r-- 1 mario mario 330M Feb 7 22:55 hash1.db
Updated on 2023-02-08
llil_sql_imp1:
The optional -reset argument to the importer script removes the DB file. The .output to /dev/null is to silence SQLite when setting PRAGMA statements. The .output without an argument restores it back to standard output.
llil_sql_qry1:
This is the complementary script to output the data with proper formatting and sort order.
| [reply] [d/l] [select] |
|
If the requirement calls for a one-time report, then I tried another implementation using SQLite; without a view, trigger, and primary key. A group by clause is used for the select statement, now that the DB has duplicate keys. The total time sheds 5 seconds including smaller DB size. Note: Variant two is not compatible with variant one.
running:
$ wc -l big1.txt big2.txt big3.txt
3515200 big1.txt
3515200 big2.txt
3515200 big3.txt
10545600 total
$ time bash llil_sql_imp2 hash2.db big1.txt big2.txt big3.txt
Importing big1.txt
Importing big2.txt
Importing big3.txt
real 0m4.350s
user 0m4.298s
sys 0m0.049s
$ time bash llil_sql_qry2 hash2.db >out.txt
real 0m9.668s
user 0m9.537s
sys 0m0.122s
$ ls -lh hash2.db
-rw-r--r-- 1 mario mario 160M Feb 7 22:57 hash2.db
q - Text as Data:
I came across q - Run SQL directly on delimited files and multi-file sqlite databases on GitHub. Specify -T for tab-delimited output. You may like this tool. Check it out.
$ time ./linux-q -T "select name, sum(value) from hash.db:::kv_store g
+roup by name order by sum(value) desc, name" >out.txt
real 0m49.136s
user 0m48.667s
sys 0m0.423s
llil_sql_imp2:
Like variant one, the optional -reset argument to the importer script removes the DB file. The .output to /dev/null is to silence SQLite when setting PRAGMA statements. The .output without an argument restores it back to standard output.
llil_sql_qry2:
This produces output with proper formatting and sort order.
| [reply] [d/l] [select] |
|
ATTACH DATABASE 'hash2.db' AS ha_2;
ATTACH DATABASE 'hash3.db' AS ha_3;
SELECT name, SUM(value) FROM (
SELECT * FROM main.kv_store UNION ALL
SELECT * FROM ha_2.kv_store UNION ALL
SELECT * FROM ha_3.kv_store
) GROUP BY name ORDER BY SUM(value) DESC, name;
DETACH DATABASE ha_2;
DETACH DATABASE ha_3;
running:
Each import takes ~ 1.5 seconds on my machine. The query takes the same amount of time as before.
$ bash llil_sql_imp2 hash1.db big1.txt
$ bash llil_sql_imp2 hash2.db big3.txt
$ bash llil_sql_imp2 hash3.db big3.txt
$ time bash llil_sql_qryn hash1.db hash2.db hash3.db >out.txt
real 0m9.650s
user 0m9.534s
sys 0m0.106s
llil_sql_qryn:
| [reply] [d/l] [select] |
Re: Rosetta Code: Long List is Long :awk(1)+sort(1)
by parv (Parson) on Dec 12, 2022 at 07:59 UTC
|
#!/bin/sh
# Source: https://perlmonks.org/index.pl?node_id=11148773
#
# This is one shell implementation based on the problem specification
+...
#
# Rosetta Code: Long List is Long, 20221130,
# by eyepopslikeamosquito
# https://perlmonks.org/?node_id=11148465
case $# in
0 )
printf "Give a list of files to sort.\n" >&2
exit 1
;;
esac
start=$( date '+%s' )
# Takes ~135 s.
awk ' \
{ cat_count[ $1 ] += $2 } \
END \
{ for ( cat in cat_count ) \
{ printf "%s\t%s\n", cat, cat_count[ cat ] } \
} \
' $@ \
| sort -k2,2rn -k1,1
end=$( date '+%s' )
printf "total time: %d s\n" $(( end - start )) >&2
| [reply] [d/l] |
|
54 seconds LANG=en_US.UTF-8
33 seconds LANG=C sort -k2,2rn -k1,1
23 seconds LANG=C sort -k1,1 | sort -k2,2rn
Testing:
#!/bin/sh
# https://www.perlmonks.org/?node_id=11148773
if [ $# -eq 0 ]; then
printf "Give a list of files to sort.\n" >&2
exit 1
fi
LANG=C
awk '
{ cat_count[ $1 ] += $2 }
END {
for ( cat in cat_count )
printf "%s\t%s\n", cat, cat_count[ cat ]
}
' $@ \
| sort -k1,1 | sort -k2,2rn
printf "total time: %d s\n" $SECONDS >&2
| [reply] [d/l] [select] |
|
Good point about using LANG=C (to make for a fairer comparison for I had set ascii encoding to parse the input (but not during sorting🤔) in my Python version).
With that change, takes ~99 s; that and 2 sorts, takes ~60 s.
Read more... awk+sort, now with double the sorts (1168 Bytes)
| [reply] [d/l] [select] |
Re: Rosetta Code: Long List is Long -- Python
by parv (Parson) on Dec 02, 2022 at 10:26 UTC
|
Environment: FreeBSD stable/13 (amd64) as guest in VirtualBox 6.1.30 on Windows 10 host; 2 CPUs & ~8 GB of RAM are allocated to the VM on ThinkPad X260 (Intel i5-6500U CPU, 16 GB RAM). perl is 5.32.1; python is 3.10.8.
I may attempt at a Rust, which is I am currently learning, version.
Your Perl version takes ~7-8 minute (time to collect data is same enough as my Python version). Mind that time values are of just one run each.
llil start
get_properties : 44 secs
sort + output : 429 secs
total : 473 secs
From choroba's attempt ...
start
get properties: 60 secs
sort + output: 63 secs
total: 123 secs
From my Python attempt (generally takes ~3-4 minute) ...
# Manually rounded off after posting.
collect time : 37.6 s
sort_via_cmp_to_key time : 115.1 s
sort+format time : 115.1 s
total processing time : 152.7 s
output time : 22.0 s
total time : 174.7 s
... most of time savings (of ~10-20 s) over my earlier Python attempts came from ...
-
printing the output as one string instead of printing each line in a loop;
- sorting the dict via keys instead of sorting an iterator of tuples created from the "dict";
- and using subtraction to compare the numbers instead of explicitly returning one of -1, 0, 1 based on [><=] comparison.
Later I realized there was one extraneous loop; removed now, but the patch keeps the history.
... said patch (this filler text is to better separate from preceding code portion) ...
Read more... patch removes one extra loop, aligns the timing text (3 kB)
| [reply] [d/l] [select] |
|
$ python3 llil.py big1.txt big2.txt big3.txt > out1.txt
collect time : 5.2 s
sort_via_cmp_to_key time: 17.7 s
sort+format time : 17.7 s
total processing time : 22.9 s
output time : 3.5 s
total time : 26.4 s
Python with Pyston-lite loaded automatically:
# Install pyston_lite_autoload to ~/.local/lib/python3.10/site-package
+s/.
$ python3 -m pip install --user pyston_lite_autoload
$ python3 llil.py big1.txt big2.txt big3.txt > out2.txt
collect time : 4.4 s
sort_via_cmp_to_key time: 15.8 s
sort+format time : 15.8 s
total processing time : 20.2 s
output time : 3.2 s
total time : 23.4 s
Pyston 2.3.5:
I also tried Pyston 2.3.5. But first, I needed to modify the function declarations to resolve "TypeError: 'type' object is not subscriptable.
$ diff llil.py llil2.py
53c53
< def collect( data_list :list ) ->dict[ str, int ]:
---
> def collect( data_list :list ) ->dict:
87c87
< def process( cat_count :dict[ str, int ] ) ->Generator[ str, None, N
+one ]:
---
> def process( cat_count :dict ) ->Generator:
110c110
< def sort_via_cmp_to_key( cat_count :dict[ str, int ] ) ->list[ str ]
+:
---
> def sort_via_cmp_to_key( cat_count :dict ) ->list:
$ ./pyston_2.3.5/bin/pyston3 llil2.py big1.txt big2.txt big3.txt > out
+3.txt
collect time : 3.7 s
sort_via_cmp_to_key time: 12.2 s
sort+format time : 12.2 s
total processing time : 15.9 s
output time : 3.0 s
total time : 18.8 s
Look at Pyston go; taking Python performance to a new level :) How does Python/Pyston compare to Perl? I ran the dualvar demonstration for comparison.
$ perl dualvar.pl big1.txt big2.txt big3.txt >out4.txt
start
get properties: 6 secs
sort + output: 16 secs
total: 22 secs
The Perl code consumes around 2340 MB versus 1790 MB for Python.
See also:
Python 3.12 Goals
Python is about to get faster
The Pyston Blog
Numba
| [reply] [d/l] [select] |
|
def sort_native( cat_count ):
once = sorted( cat_count.keys() )
return sorted( once, key = lambda k: cat_count[ k ], reverse = True
+ )
...?
In long, before eyepopslikeamosquito posted ...
sort { $href->{$b} <=> $href->{$a} } sort keys %{$href}
... I had missed to notice the sorting order. I decided to do the same for Python (native) version instead of using functools.cmp_to_key function. I also realized that I was doing the sorting other way (sorting keys by value, followed by sorting by key), so was not getting the expected output (which made me to use cmp_to_key instead).
Replacing sort_via_cmp_to_key with ...
def sort_native( cat_count :dict[ str, int ] ) ->list[ str ]:
"""
Returns:
A `list` of sorted keys by decreasing order of values & increasi
+ng order of
keys.
Args:
cat_count: A `dict` with string key & integer value.
"""
once = sorted( cat_count.keys() )
return sorted( once, key = lambda k: cat_count[ k ], reverse = True
+ )
... reduces the sort time by ~10 times (~11-16 s; also produces the expected output) in my run environment.
time passes, so slowly🎶 ...
Putting the complete program here (~28-35 s) ...
Read more... Python version, updated with native sort (6 kB)
| [reply] [d/l] [select] |
|
|
Re: Rosetta Code: Long List is Long (Updated Solutions)
by eyepopslikeamosquito (Archbishop) on Dec 05, 2022 at 09:32 UTC
|
As you might expect, I've created improved versions of my original Perl and C++ solutions posted in the root node.
Since I'm all out of fresh ideas to try, I'll post my improved solutions here.
Further performance improvement suggestions are welcome.
Improved Perl Version llil2.pl
# llil2.pl
# Example run: perl llil2.pl tt1.txt tt2.txt tt3.txt >out.txt
use strict;
use warnings;
# --------------------------------------------------------------------
+--
# LLiL specification
# ------------------
# A LLiL-format file is a text file.
# Each line consists of a lowercase name a TAB character and a non-neg
+ative integer count.
# That is, each line must match : ^[a-z]+\t\d+$
# For example, reading the LLiL-format files, tt1.txt containing:
# camel\t42
# pearl\t94
# dromedary\t69
# and tt2.txt containing:
# camel\t8
# hello\t12345
# dromedary\t1
# returns this hashref:
# $hash_ret{"camel"} = 50
# $hash_ret{"dromedary"} = 70
# $hash_ret{"hello"} = 12345
# $hash_ret{"pearl"} = 94
# That is, values are added for items with the same key.
#
# To get the required LLiL text, you must sort the returned hashref
# descending by value and insert a TAB separator:
# hello\t12345
# pearl\t94
# dromedary\t70
# camel\t50
# To make testing via diff easier, we further sort ascending by name
# for lines with the same value.
# --------------------------------------------------------------------
+--
# Function get_properties
# Read a list of LLiL-format files
# Return a reference to a hash of properties
sub get_properties
{
my $files = shift; # in: reference to a list of LLiL-format fil
+es
my %hash_ret; # out: reference to a hash of properties
for my $fname ( @{$files} ) {
open( my $fh, '<', $fname ) or die "error: open '$fname': $!";
while (<$fh>) {
chomp;
my ($word, $count) = split /\t/;
$hash_ret{$word} += $count;
}
close($fh) or die "error: close '$fname': $!";
}
return \%hash_ret;
}
# ----------------- mainline -----------------------------------------
+--
@ARGV or die "usage: $0 file...\n";
my @llil_files = @ARGV;
warn "llil2 start\n";
my $tstart1 = time;
my $href = get_properties( \@llil_files );
my $tend1 = time;
my $taken1 = $tend1 - $tstart1;
warn "get_properties : $taken1 secs\n";
my $tstart2 = time;
# Using two sorts is waaay faster than one in Perl for some reason! (s
+ee [id://11148545])
for my $key ( sort { $href->{$b} <=> $href->{$a} } sort keys %{$href}
+) {
print "$key\t$href->{$key}\n";
}
my $tend2 = time;
my $taken2 = $tend2 - $tstart2;
my $taken = $tend2 - $tstart1;
warn "sort + output : $taken2 secs\n";
warn "total : $taken secs\n";
Improved C++ Version llil2.cpp
// llil2.cpp. C++ 11 version of Perl llil.pl.
// llil2.cpp is faster than llil.cpp while also clarifying limits:
// - all keys should be less than 200 or so characters in length
// - numbers are 64 bit integers (max: 9,223,372,036,854,775,807)
// g++ compile on Linux:
// g++ -o llil2 -std=c++11 -Wall -O3 llil2.cpp
// This g++ command also works with mingw C++ compiler (https://source
+forge.net/projects/mingw-w64)
// that comes bundled with Strawberry Perl (C:\Strawberry\c\bin\g++.ex
+e).
// Example run: llil2 tt1.txt tt2.txt tt3.txt >out.txt
#include <cstddef>
#include <cstdint>
#include <cstdlib>
#include <cstring>
#include <ctime>
#include <cstdio>
#include <string>
#include <vector>
#include <set>
#include <map>
#include <algorithm>
#include <utility>
#include <iostream>
#include <fstream>
#include <sstream>
static_assert(sizeof(size_t) == sizeof(int64_t), "size_t too small, ne
+ed a 64-bit compile");
// -------------------------------------------------------------------
+---------
// Crude hack to see Windows Private Bytes in Task Manager by sleeping
+ at
// program end (see also sleep hack at end of main)
// #include <chrono>
// #include <thread>
// For some performance hacks to speed up C++ I/O see:
// https://www.reddit.com/r/rust/comments/9xedap/how_to_achieve_fast
+_stdinstdout_io_suitable_for/
// The only one we use here is to prefer "\n" to std::endl to reduce s
+tdout flushing
// -------------------------------------------------------------------
+---------
typedef long long llil_int_type;
using str_int_type = std::pair<std::string, llil_int_type>;
using map_str_int_type = std::map<std::string, llil_int_type>;
using vec_str_int_type = std::vector<str_int_type>;
// Mimic the Perl get_properties subroutine --------------------------
+--
// Limit line length and use lower level ANSI C functions to try to bo
+ost I/O performance
// TODO (maybe):
// - reading: Try ::setvbuf(fh, NULL, _IOFBF, 65536) or some such on
+ input files
// - writing: Try ::setvbuf(stdout, stdout_buf, _IOFBF, sizeof(stdou
+t_buf)) on stdout
// ... or instead of writing to stdout, take an output fi
+le as a program argument
#define MAX_LINE_LEN_L 255
static void get_properties(
int nfiles, // in: the number of input files
char* fname[], // in: the input file names
map_str_int_type& hash_ret) // out: a hash of properties
{
FILE* fh;
char line[MAX_LINE_LEN_L+1];
char* word; char* count;
for (int i = 0; i < nfiles; ++i) {
fh = ::fopen(fname[i], "r");
if (fh == NULL) {
std::cerr << "Error opening '" << fname[i] << "'\n";
return;
}
while ( ::fgets(line, MAX_LINE_LEN_L, fh) != NULL ) {
word = ::strtok(line, "\t");
count = ::strtok(NULL, "\n");
hash_ret[word] += ::atoll(count);
}
::fclose(fh);
}
}
// -------------------------------------------------------------------
+--
int main(int argc, char* argv[])
{
if (argc < 2) {
std::cerr << "usage: llil2 file1 file2 ... >out.txt\n";
return 1;
}
std::cerr << "llil2 start\n";
time_t tstart1 = ::time(NULL);
// Create the hash of properties
map_str_int_type hash_ret;
get_properties(argc - 1, &argv[1], hash_ret);
time_t tend1 = ::time(NULL);
long taken1 = static_cast<long>(::difftime(tend1, tstart1) + 0.5);
std::cerr << "get_properties : " << taken1 << " secs\n";
// Sort descending by value, i.e. mimic this Perl code in C++:
// sort { $href->{$b} <=> $href->{$a} || $a cmp $b } keys %{$href
+}
time_t tstart2 = ::time(NULL);
vec_str_int_type v( hash_ret.begin(), hash_ret.end() );
std::sort( v.begin(), v.end(),
[](const str_int_type& left, const str_int_type& right) { return
+ right.second != left.second ? right.second < left.second : left.firs
+t < right.first; }
);
// Output the merged properties
for ( auto const& n : v ) { std::cout << n.first << '\t' << n.secon
+d << '\n'; }
time_t tend2 = ::time(NULL);
long taken2 = static_cast<long>(::difftime(tend2, tstart2) + 0.5);
long taken = static_cast<long>(::difftime(tend2, tstart1) + 0.5);
std::cerr << "sort + output : " << taken2 << " secs\n";
std::cerr << "total : " << taken << " secs\n";
// Hack to see Private Bytes in Windows Task Manager (uncomment nex
+t line so process doesn't exit too quickly)
// std::this_thread::sleep_for(std::chrono::milliseconds(90000000
+));
return 0;
}
Performance Analysis
Performance of my second Perl version improved from:
get_properties : 11 secs
sort + output : 74 secs
total : 85 secs
to:
get_properties : 11 secs
sort + output : 25 secs
total : 36 secs
update (much later in llil2grt.pl) to:
get_properties : 10 secs
sort + output : 20 secs
total : 30 secs
Performance of my second C++ version improved from:
get_properties : 9 secs
sort + output : 7 secs
total : 16 secs
to:
get_properties : 6 secs
sort + output : 6 secs
total : 12 secs
Memory use (Windows Private Bytes) was 2,896,104K for the Perl version (update: 2,657,968K), 1,176,048K for the C++ std::unordered_map version (update: 1,218,720K for std::map).
Update: Surprisingly, making a one line change in llil2.cpp above from:
using map_str_int_type = std::unordered_map<std::string, llil_int_type
+>;
// to (llil2a.cpp):
using map_str_int_type = std::map<std::string, llil_int_type>;
resulted in a significant speed improvement in llil2a (with similar Windows Private Bytes):
get_properties : 4 secs
sort + output : 5 secs
total : 9 secs
My second Perl version improved from 85 seconds to 36 seconds (update: much later to 30 seconds in llil2grt).
Incredibly, this spectacular performance improvement is entirely due to a very surprising one line change from:
sort { $href->{$b} <=> $href->{$a} || $a cmp $b } keys %{$href}
to:
sort { $href->{$b} <=> $href->{$a} } sort keys %{$href} )
When more is known of the reason for this incredible difference, I'll update this node.
In contrast, I had to work a lot harder to improve the performance of my C++ version,
switching in desperation to some ugly lower level ANSI C functions in its get_properties function.
For cheap thrills, I also switched from a 32-bit to a 64-bit integer counter llil_int_type.
I suspect future minor performance tweaks may come from further improving I/O
(e.g. by fiddling with file buffering and buffer sizes).
As described here this is by far Perl's best performance so far in my three serious Rosetta nodes:
- For the simple GoL algorithm, C++ was 12.5 times faster, memory use was 2.8 times less.
- For the complex GoL algorithm, C++ was 212.5 times faster; memory use was 10.1 times less.
- For the Long List is Long algorithm (this node), C++ was 3 times faster; memory use 2.2 times less.
I suspect this is because Long List is Long seems to be mostly I/O bound, while the GoL algorithms were mostly CPU bound.
Updated: Changed std::unordered_map to std::map in llil2.cpp above.
| [reply] [d/l] [select] |
|
Some wiser monk might correct me, but I think this difference is because "simple" blocks can be internally optimized
(edited: typoes corrected after reading the replies)
my @sorted = sort @unsorted;
my @sorted = sort { $a <=> $b } @unsorted;
my @sorted = sort { $a cmp $b } @unsorted;
Are all internally optimized to something simple and efficient. Once the block gets more complicated, like having multiple statements and/or multiple expressions, every block has to be wrapped in scopes *for each call* to that sub. This is one of the reasons why the Larry-Rossler Guttman-Rosler Transform is so efficient.
my @sorted = sort { $a->{id} <=> $b->{id} || $a->{foo} cmp $b->{foo} }
+ @unsorted; # SLOW
my @sorted = map { $_->[1] } sort { $a->[0] cmp $b->[0] } map { pack "
+l>A*", $_->{id}, $_->{foo} } @unsorted; # FAST
Enjoy, Have FUN! H.Merijn
| [reply] [d/l] [select] |
|
| [reply] [d/l] |
|
|
|
|
use Scalar::Util 'dualvar';
our @data;
while (my ($k, $v) = each %{$href}) {
push @data, dualvar($v, $k);
}
# output array of dualvars; sorted by number, string
print "$_\t".(0+$_)."\n" for sort { $b <=> $a } sort @data;
# llil2.pl sort + output : 19 secs
# llil3.pl sort + output : 16 secs -- dualvar
| [reply] [d/l] [select] |
|
I noticed it, but (wrongly) assumed applying its remarkable two sort trick to my original
solution would have the same effect.
For completeness, I created llil2d.pl (shown below) by applying your dualvar array trick
to my original llil2.pl two-sort solution, with minimal changes.
I can confirm that it is indeed about 3 seconds faster and with slightly lower memory use.
Despite using Perl for 20 years, I'd never heard of dualvar before
(update: oops, turns out I had :).
Huge kudos to marioroy for unearthing this!
llil2d start
get_properties : 11 secs
sort + output : 22 secs
total : 33 secs
Memory use (Windows Private Bytes): 2,824,184K
(slightly lower than 2,896,104K for llil2.pl)
For completeness, here is my adjusted llil2d.pl:
# llil2d.pl. Remarkable dualvar version based on [marioroy]'s concocti
+on.
# Example run: perl llil2d.pl tt1.txt tt2.txt tt3.txt >out.txt
use strict;
use warnings;
use feature qw{say};
use Scalar::Util qw{dualvar};
# --------------------------------------------------------------------
+--
# LLiL specification
# ------------------
# A LLiL-format file is a text file.
# Each line consists of a lowercase name a TAB character and a non-neg
+ative integer count.
# That is, each line must match : ^[a-z]+\t\d+$
# For example, reading the LLiL-format files, tt1.txt containing:
# camel\t42
# pearl\t94
# dromedary\t69
# and tt2.txt containing:
# camel\t8
# hello\t12345
# dromedary\t1
# returns this hashref:
# $hash_ret{"camel"} = 50
# $hash_ret{"dromedary"} = 70
# $hash_ret{"hello"} = 12345
# $hash_ret{"pearl"} = 94
# That is, values are added for items with the same key.
#
# To get the required LLiL text, you must sort the returned hashref
# descending by value and insert a TAB separator:
# hello\t12345
# pearl\t94
# dromedary\t70
# camel\t50
# To make testing via diff easier, we further sort ascending by name
# for lines with the same value.
# --------------------------------------------------------------------
+--
# Function get_properties
# Read a list of LLiL-format files
# Return a reference to a hash of properties
sub get_properties
{
my $files = shift; # in: reference to a list of LLiL-format fil
+es
my %hash_ret; # out: reference to a hash of properties
for my $fname ( @{$files} ) {
open( my $fh, '<', $fname ) or die "error: open '$fname': $!";
while (<$fh>) {
chomp;
my ($word, $count) = split /\t/;
$hash_ret{$word} += $count;
}
close($fh) or die "error: close '$fname': $!";
}
return \%hash_ret;
}
# ----------------- mainline -----------------------------------------
+--
@ARGV or die "usage: $0 file...\n";
my @llil_files = @ARGV;
warn "llil2d start\n";
my $tstart1 = time;
my $href = get_properties( \@llil_files );
my $tend1 = time;
my $taken1 = $tend1 - $tstart1;
warn "get_properties : $taken1 secs\n";
my $tstart2 = time;
my @data;
while ( my ($k, $v) = each %{$href} ) { push @data, dualvar($v, $k) }
# Using two sorts is waaay faster than one! (see [id://11148545])
for my $key ( sort { $b <=> $a } sort @data ) {
say "$key\t" . (0 + $key);
}
my $tend2 = time;
my $taken2 = $tend2 - $tstart2;
my $taken = $tend2 - $tstart1;
warn "sort + output : $taken2 secs\n";
warn "total : $taken secs\n";
Update: llil2grt.pl is about three seconds faster than llil2d.pl above, while using slightly less memory.
References Added Later
Dualvar:
Some ideas to try in the future:
See also:
| [reply] [d/l] [select] |
|
> llil2m big1.txt big2.txt big3.txt >llil2m.tmp
llil2m start
get_properties CPU time : 4.187 secs
vector copy CPU time : 0.612 secs
vector sort CPU time : 2.89 secs
write stdout CPU time : 1.383 secs
total CPU time : 9.074 secs
total wall clock : 9 secs
So far so good. This is just the fastest C++ solution found previously.
Beside myself with excitement, I defined MARIOROY_TWO_SORT_TRICK_L ... but sobbed when I saw:
> llil2m big1.txt big2.txt big3.txt >llil2m.tmp
llil2m start
get_properties CPU time : 4.286 secs
vector copy CPU time : 0.613 secs
vector stable sort CPU time : 2.977 secs
write stdout CPU time : 1.363 secs
total CPU time : 9.244 secs
total wall clock : 9 secs
Memory use (Windows Private Bytes): 1,218,716K
No improvement. At all. Aaaargh!
When I changed stable_sort to sort the sort speed improved dramatically from
2.977 secs to just 0.750 secs ... but of course the output was wrong.
It seems there is a high performance price to pay for a stable sort in this case.
Oh well, at least we now have an alternative nine second solution in C++ using a stable sort.
Sorting Algorithms
BTW, from sort - perl pragma to control sort() behaviour
The sort pragma is now a no-op, and its use is discouraged.
...
The default sort has been stable since v5.8.0, and given this consistent behaviour for almost two decades, everyone has come to assume stability.
Given the above performance difference between stable and non-stable sorts, this looks like an unfortunate decision to me.
After all, there may be times when you really want the performance boost of a non-stable sort in Perl.
Update: Stroustrup, in his C++ Programming Language book,
notes that stable_sort improves towards N*log(N) when the system has sufficient memory, but degrades
to N*log(N)*log(N) otherwise ... so I guess the use case for needing a non-stable sort in Perl
is when you are sorting huge arrays.
See also Sorting algorithm (wikipedia).
llil2grt.cpp
Here is my C++ solution inspired by this llil2grt.pl Perl GRT solution.
Desperate to avoid the sort function after being burnt during my llil2m.cpp adventure,
and remembering jwkrahn's cute GRT trick of achieving a descending sort via
an ascending sort of negative numbers, I had to try a similar stunt in C++.
In llil2grt below, I avoided sort simply by
creating an inverted set_int_str_type container from the existing map_str_int_type container.
// llil2grt.cpp.
// Inspired by llilgrt.pl: improve sort performance via a negative cou
+nt.
// g++ compile on Linux:
// g++ -o llil2grt -std=c++11 -Wall -O3 llil2grt.cpp
// This g++ command also works with mingw C++ compiler (https://source
+forge.net/projects/mingw-w64)
// that comes bundled with Strawberry Perl (C:\Strawberry\c\bin\g++.ex
+e).
// Example run: llil2grt tt1.txt tt2.txt tt3.txt >out.txt
#include <cstddef>
#include <cstdint>
#include <cstdlib>
#include <cstring>
#include <ctime>
#include <cstdio>
#include <string>
#include <vector>
#include <set>
#include <map>
#include <algorithm>
#include <utility>
#include <iostream>
#include <fstream>
#include <sstream>
static_assert(sizeof(size_t) == sizeof(int64_t), "size_t too small, ne
+ed a 64-bit compile");
// -------------------------------------------------------------------
+---------
// Crude hack to see Windows Private Bytes in Task Manager by sleeping
+ at
// program end (see also sleep hack at end of main)
// #include <chrono>
// #include <thread>
// -------------------------------------------------------------------
+---------
typedef long long llil_int_type;
using str_int_type = std::pair<std::string, llil_int_type>;
using int_str_type = std::pair<llil_int_type, std::string>;
using map_str_int_type = std::map<std::string, llil_int_type>;
using vec_str_int_type = std::vector<str_int_type>;
using set_int_str_type = std::set<int_str_type>;
// Mimic the Perl get_properties subroutine --------------------------
+--
// Limit line length and use ANSI C functions to try to boost performa
+nce
#define MAX_LINE_LEN_L 255
static void get_properties(
int nfiles, // in: the number of input files
char* fname[], // in: the input file names
map_str_int_type& hash_ret) // out: a hash of properties
{
FILE* fh;
char line[MAX_LINE_LEN_L+1];
char* word;
llil_int_type count;
for (int i = 0; i < nfiles; ++i) {
fh = ::fopen(fname[i], "r");
if (fh == NULL) {
std::cerr << "Error opening '" << fname[i] << "' : errno=" <<
+ errno << "\n";
continue;
}
while ( ::fgets(line, MAX_LINE_LEN_L, fh) != NULL ) {
word = ::strtok(line, "\t");
count = ::atoll( ::strtok(NULL, "\n") );
hash_ret[word] -= count;
}
::fclose(fh);
}
}
// -------------------------------------------------------------------
+--
int main(int argc, char* argv[])
{
if (argc < 2) {
std::cerr << "usage: llil2grt file1 file2 ... >out.txt\n";
return 1;
}
std::cerr << "llil2grt start\n";
time_t tstart1 = ::time(NULL);
clock_t cstart1 = ::clock();
// Create the hash of properties
map_str_int_type hash_ret;
get_properties(argc - 1, &argv[1], hash_ret);
clock_t cend1 = ::clock();
double ctaken1 = (double) (cend1 - cstart1) / (double)CLOCKS_PER_SE
+C;
std::cerr << "get_properties CPU time : " << ctaken1 << " secs
+\n";
clock_t cstart2 = ::clock();
// To avoid calling sort(), try creating an inverted std::set conta
+iner
// Note: negative count gives desired ordering (just like Perl GRT
+sort :)
set_int_str_type s;
for ( auto const& kv : hash_ret ) {
// Note: emplace_hint() was a lot faster than emplace()
s.emplace_hint( s.end(), kv.second, kv.first );
}
clock_t cend2s = ::clock();
// Output the (already sorted) std::set - no sort() function requir
+ed!
// Note: fix up negative count via -n.first
for ( auto const& n : s ) { std::cout << n.second << '\t' << -n.fir
+st << '\n'; }
clock_t cend2 = ::clock();
time_t tend2 = ::time(NULL);
long ttaken = static_cast<long>(::difftime(tend2, tstart1) + 0
+.5);
double ctaken = (double) (cend2 - cstart1) / (double)CLOCKS_PER_
+SEC;
double ctaken2s = (double) (cend2s - cstart2) / (double)CLOCKS_PER
+_SEC;
double ctaken2o = (double) (cend2 - cend2s) / (double)CLOCKS_PER
+_SEC;
std::cerr << "emplace set sort CPU time : " << ctaken2s << " sec
+s\n";
std::cerr << "write stdout CPU time : " << ctaken2o << " sec
+s\n";
std::cerr << "total CPU time : " << ctaken << " sec
+s\n";
std::cerr << "total wall clock time : " << ttaken << " sec
+s\n";
// Hack to see Private Bytes in Windows Task Manager (uncomment nex
+t line so process doesn't exit too quickly)
// std::this_thread::sleep_for(std::chrono::milliseconds(90000000
+));
return 0;
}
I was pleasantly surprised to see this improved my fastest C++ solution from 9 seconds down to 7:
> llil2grt big1.txt big2.txt big3.txt >llil2grt.tmp
llil2grt start
get_properties CPU time : 4.252 secs
emplace set sort CPU time : 1.282 secs
write stdout CPU time : 1.716 secs
total CPU time : 7.254 secs
total wall clock time : 7 secs
Memory use (Windows Private Bytes): 1,626,728K
Though faster, it's no surprise memory use increased because instead of sorting in place,
you're creating a new set_int_str_type container.
Updated: Minor cosmetic changes were made to llil2grt.cpp after posting.
Added "Sorting Algorithms" section with a bit more information on stable sorts.
| [reply] [d/l] [select] |
Re: Rosetta Code: Long List is Long (faster)
by Anonymous Monk on Dec 27, 2022 at 14:47 UTC
|
Please feel free to respond away with solutions...
Short version: J script runs in ~5.7 sec using ~650 MB to produce exactly same output. Which makes it fastest of all solutions so far.
(The caveat: "words" are supposed to be not too different in length, they are temporarily padded to longest during program run (i.e. not padded at all with current test sample), expect deterioration if, say, their lengths differ 2 or more orders of magnitude or so, especially if just a few are very long. I didn't check.)
###################### (Prose and previous results, can be skipped)
I was shocked to notice this thread is almost a month old already. While I'm in no hurry and have been pursuing what follows at leisure and only rarely (kind of "for dessert"), it's better to publish this at long last, be it "final" optimized version or not (I'm sure it can be improved a lot), before the thread is dead cold and whoever participated have to make effort to read their own code because of time elapsed.
As a reference frame, here are assortment of results of previous solutions, with my hardware:
llil2grt.pl (see 11148713) (fastest "pure core-only Perl"), Windows:
llil2grt start
get_properties : 16 secs
sort + output : 31 secs
total : 47 secs
Peak working set (memory): 2,744,200K
Same code & data & PC, Linux: (I didn't investigate why such difference.)
llil2grt start
get_properties : 11 secs
sort + output : 20 secs
total : 31 secs
2,152,848 Kbytes of RAM were used
I assume that Judy (see 11148585) is best in both speed and memory for non-parallel Perl solutions, with same caveat as at the very top: "words" are temporarily padded to fixed length e.g. here to 10 bytes.:
my_Judy_test start
get_properties : 13 secs
sort + output : 7 secs
total : 20 secs
349,124 Kbytes of RAM were used
Being lazy bum, I didn't compile C++ solutions (nor do I code in C++), here is a copy-paste from 11148969, I assume it is the best result so far, among C++ and all others: (For my PC, I expect time to be worse.)
llil2grt start
get_properties CPU time : 4.252 secs
emplace set sort CPU time : 1.282 secs
write stdout CPU time : 1.716 secs
total CPU time : 7.254 secs
total wall clock time : 7 secs
Memory use (Windows Private Bytes): 1,626,728K
###################### (End of prose and previous results)
Code below generates next message with RAM usage taken from Windows Task Manager (to be on par with how it was measured for Perl), while script pauses for a key (Ctrl-D or Ctrl-Z + Enter combo as usual for Linux or Windows, respectively) after finish:
Read and parse input: 1.636
Classify, sum, sort: 2.206
Format and write output: 1.895
Total time: 5.737
Finished. Waiting for a key...
Peak working set (memory): 657,792K
The injection of CR into output lines is only required on Windows (actually, not required at all) to later ensure no difference with output from Perl. The "magic constant" 3 for number width can be any, and is only used for intermediate step.
I had to make this code slightly less readable than it was during development, by somewhat aggressively re-using over and over again same variable names for words and nums, as data are processed and modified as script progresses. They were different "self-explanatory" names at each stage, but because arrays are huge, it's better to immediately over-write variable on successive assignments to conserve memory. "Erasing" throw-away helper arrays (similar to undef in Perl) serves same purpose.
Actually, during development I was playing with this toy dataset, here's original data and result:
text =: noun define
tango 1
charlie 2
bravo 1
foxtrot 4
alpha 3
foxtrot 1
bravo 1
foxtrot 7
)
NB. Do work here...
] text
foxtrot 12
alpha 3
bravo 2
charlie 2
tango 1
The script:
NB. -----------------------------------------------------------
NB. --- This file is "llil.ijs"
NB. --- Run as e.g.:
NB.
NB. jconsole.exe llil.ijs big1.txt big2.txt big3.txt out.txt
NB.
NB. --- (NOTE: last arg is output filename, file is overwritten)
NB. -----------------------------------------------------------
args =: 2 }. ARGV
fn_out =: {: args
fn_in =: }: args
NUM_LENGTH =: 3
PAD_CHAR =: ' '
make_sel =: [: (1 2 0& |:) @ ,: ([ ,. ] - [: , [)
sort_some =: ([: /:~ {)`[`] }
text =: , freads " 0 fn_in
lf_pos =: I. text = LF
tab_pos =: I. text = TAB
words =: ((0 , >: }: lf_pos) make_sel tab_pos) ];.0 text
nums =: 0&". (tab_pos make_sel lf_pos) ; @: (<;.0) text
erase 'text' ; 'lf_pos' ; 'tab_pos'
t1 =: (6!:1) '' NB. time since engine start
nums =: words +//. nums
words =: ~. words
'words nums' =: (\: nums)& { &.:>"_1 words ; nums
starts =: I. ~: nums
ranges =: starts ,. (}. starts , # nums) - starts
count =: # starts
sort_words =: monad define
'ranges words' =. y
range =. ({. + i. @ {:) {. ranges
(}. ranges) ; range sort_some words
)
words =: > {: sort_words ^: count ranges ; words
erase 'starts' ; 'ranges'
t2 =: (6!:1) '' NB. time since engine start
nums =: (- NUM_LENGTH) ]\ NUM_LENGTH ": nums
text =: , words ,. TAB ,. (nums ,. CR) ,. LF
erase 'words' ; 'nums'
text =: (#~ ~: & PAD_CHAR) text
text fwrite fn_out
erase < 'text'
t3 =: (6!:1) '' NB. time since engine start
echo 'Read and parse input: ' , ": t1
echo 'Classify, sum, sort: ' , ": t2 - t1
echo 'Format and write output: ' , ": t3 - t2
echo 'Total time: ' , ": t3
echo ''
echo 'Finished. Waiting for a key...'
stdin ''
exit 0
| [reply] [d/l] [select] |
|
I was shocked to notice this thread is almost a month old already ...
it's better to publish this at long last, be it "final" optimized version or not (I'm sure it can be improved a lot),
before the thread is dead cold and whoever participated have to make effort to read their own code because of time elapsed
Thanks for posting.
I suggest you just take your time and enjoy it without worrying too much about how old the thread is.
This is actually one of my favourite features of Perl Monks, so much so that I wrote a meditation about it:
Necroposting Considered Beneficial. :)
Your reply motivated me into finally getting around to installing Linux on my newish Windows laptop.
Being lazy, I did this with a single wsl --install command
to install the default Ubuntu distribution of Linux from the Microsoft Store.
AFAIK, the main alternative is to install VMware, followed by multiple different Linux distros.
Anyways, after doing that I saw similar performance of both my fastest Perl and C++ versions.
For llil2grt.cpp on Windows 11:
llil2grt start
get_properties CPU time : 4.252 secs
emplace set sort CPU time : 1.282 secs
write stdout CPU time : 1.716 secs
total CPU time : 7.254 secs
total wall clock time : 7 secs
On Ubuntu WSL2 Linux (5.15.79.1-microsoft-standard-WSL2), running on the same hardware,
compiled with the identical g++ -o llil2grt -std=c++11 -Wall -O3 llil2grt.cpp, we see it runs slightly faster:
llil2grt start
get_properties CPU time : 3.63153 secs
emplace set sort CPU time : 1.09085 secs
write stdout CPU time : 1.41164 secs
total CPU time : 6.13412 secs
total wall clock time : 6 secs
Sadly, it's becoming increasingly obvious that to make this run significantly faster,
I'll probably have to change the simple and elegant:
hash_ret[word] -= count;
into something much uglier,
possibly containing "emplace_hint" or other std::map claptrap
... and I just can't bring myself to do that. :)
More attractive is to leave the simple and elegant one-liner alone and instead try to
inject a faster custom std::map memory allocator
(I have no idea how to do that yet).
Conversely, my fastest Perl solution llil2grt.pl runs slightly slower on Ubuntu:
llil2grt start
get_properties : 10 secs
sort + output : 22 secs
total : 32 secs
compared to 10, 20, 30 secs on Windows 11.
Perl v5.34.0 on Linux vs Strawberry Perl v5.32.0 on Windows.
Update: while this shortened version llil2cmd-long.pl runs a bit faster on Ubuntu (but not on Windows):
llil2cmd-long start
get_properties : 7 secs
sort + output : 21 secs
total : 28 secs
The injection of CR into output lines is only required on Windows (actually, not required at all)
Yes, you're right, Windows nowadays seems perfectly happy with text files terminated with \n rather
than the traditional DOS \r\n.
By default, Perl and C++ both output text files with "\n" on Unix and "\r\n" on Windows.
I'll update my test file generator to generate identical "\n" terminated files on both Unix and Windows.
Update: Test file generators updated here: Re^3: Rosetta Code: Long List is Long (Test File Generators).
Curiously, \n seems to be slower than \r\n on Windows if you don't set binmode!
I am guessing that chomp is slower with \n than with \r\n on a Windows text stream.
Update: Are you running Strawberry Perl on Windows? Which version? (Trying to understand why your Windows Perl seems slower than mine).
Update: The processor and SSD disk (see Novabench top scoring disks) on my HP laptop:
Intel(R) Core(TM) i7-1065G7 CPU @ 1.30GHz, 1501 Mhz, 4 Core(s), 8 Logi
+cal Processor(s)
Disk (238 GB SSD): Intel Optane+238GBSSD (ave sequential read 513 MB/
+s; ave sequential write 280 MB/s) - score 73
References Added Later
Updated: Noted that llil2grt.pl on Ubuntu Linux runs slightly slower on Linux than Windows, along with detailed timings.
Clarified that I'm running Windows 11.
Added more detail on my laptop hardware.
Mentioned llil2cmd-long.pl, developed later.
| [reply] [d/l] [select] |
|
By "same PC" to run a test under Windows vs. Linux, I meant classic dual boot and GRUB as I have, but perhaps virtualization is no longer a hindrance, and, moreover, not in this case. Because I compiled llil2grt.cpp in both OSes (command line and options you advised), here's what I got:
$ ./llil2grt big1.txt big2.txt big3.txt >out.txt
llil2grt start
get_properties CPU time : 3.67015 secs
emplace set sort CPU time : 1.19724 secs
write stdout CPU time : 1.52074 secs
total CPU time : 6.3882 secs
total wall clock time : 6 secs
>llil2grt.exe big1.txt big2.txt big3.txt >out.txt
llil2grt start
get_properties CPU time : 5.577 secs
emplace set sort CPU time : 1.675 secs
write stdout CPU time : 2.484 secs
total CPU time : 9.736 secs
total wall clock time : 10 secs
Same relative difference I observe running Perl script. So looks like it's not an issue of Perl and Strawberry version you asked me about (which is latest available "5.32.1.1-64bit-PDL", BTW). I, further, compiled llil2grt.cpp using minGW shell and g++ which came with older 5.26 Strawberry, and got same 10 secs.
I'm clueless why this PC is slow with Windows. Perhaps either MS or Dell issued a patch in recent years to address Kaby Lake CPU (mine is i5-7500T) "vulnerability"? I vaguely remember it was said performance would suffer if such patch to guard against mythical attackers would be applied. Just a guess, sorry if it's ridiculous. On the other hand, J script performs the same in both OSes.
Speaking of which, to return to "interpreted language is faster than compiled C++" -- of course it's dataset bias to a large extent. I ran test with "long" files from another test file generator you suggested previously:
$ ./llil2grt long1.txt long2.txt long3.txt >out.txt
llil2grt start
get_properties CPU time : 0.559273 secs
emplace set sort CPU time : 0.004393 secs
write stdout CPU time : 0.003275 secs
total CPU time : 0.567 secs
total wall clock time : 1 secs
$ jconsole llil.ijs long1.txt long2.txt long3.txt out_j.txt
Read and parse input: 1.50791
Classify, sum, sort: 0.70953
Format and write output: 0.00430393
Total time: 2.22175
$ diff out.txt out_j.txt
$
(NUM_LENGTH changed to 10, otherwise same code)
| [reply] [d/l] [select] |
|
|
cd ~/Downloads
wget http://www.jsoftware.com/download/j903/install/j903_linux64.tar.g
+z
tar xzf j903_linux64.tar.gz -C ~/
Script modification:
Next, I made a change to not output CR so able to compare against original output without CR. For some reason, pressing a key or the return key does not exit the script (resolved by removing two lines).
$ diff llil.ijs llil2.ijs
50c50
< text =: , words ,. TAB ,. (nums ,. CR) ,. LF
---
> text =: , words ,. TAB ,. nums ,. LF
64,65d63
< echo 'Finished. Waiting for a key...'
< stdin ''
Non-AVX results:
$ ~/j903/bin/jconsole llil2.ijs big1.txt big2.txt big3.txt out.txt
Read and parse input: 0.850145
Classify, sum, sort: 2.08596
Format and write output: 1.24986
Total time: 4.18597
AVX2 results:
From the wiki, "The initial installation is non-AVX and finishing the steps upgrades to AVX or AVX2 as appropriate."
# Update J engine to AVX or AVX2, depending on the CPU.
$ ~/j903/updateje.sh
Engine: j903/j64avx2/linux
Release-b: commercial/2022-01-28T04:13:29
Library: 9.03.08
Platform: Linux 64
Installer: J903 install
InstallPath: /home/mario/j903
Contact: www.jsoftware.com
$ ~/j903/bin/jconsole llil2.ijs big1.txt big2.txt big3.txt out.txt
Read and parse input: 0.789199
Classify, sum, sort: 1.56741
Format and write output: 1.12442
Total time: 3.48103
Anonymous Monk, blessings and grace. Thank you, for the J Programming Language demonstration. | [reply] [d/l] [select] |
|
J script runs in ~5.7 sec using ~650 MB to produce exactly same output.
Which makes it fastest of all solutions so far.
The good news is that I've now found two different C++ versions that are faster.
The bad news is that they're much uglier than
my original llil2grt.cpp (which ran in about 6.2 seconds on Ubuntu).
After getting nowhere trying to speed up llil2grt.cpp, I reluctantly decided that the simple and elegant
hash_ret[word] -= count
had to go.
First the timings on Ubuntu. Because there's quite a bit
of natural variation between runs, I did three runs of each:
llil2vec start
get_properties CPU time : 3.06313 secs
emplace set sort CPU time : 0.923435 secs
write stdout CPU time : 1.392 secs
total CPU time : 5.37868 secs
total wall clock time : 6 secs
llil2vec start
get_properties CPU time : 3.1567 secs
emplace set sort CPU time : 0.970294 secs
write stdout CPU time : 1.22305 secs
total CPU time : 5.35015 secs
total wall clock time : 5 secs
llil2vec start
get_properties CPU time : 3.32019 secs
emplace set sort CPU time : 1.08277 secs
write stdout CPU time : 1.22461 secs
total CPU time : 5.62766 secs
total wall clock time : 5 secs
Ave CPU time: 5.5 secs
(Memory use (Windows Private Bytes): 1,225,580K)
llil2vec (fixed string length=6) start
get_properties CPU time : 2.09353 secs
emplace set sort CPU time : 0.795144 secs
write stdout CPU time : 1.20994 secs
total CPU time : 4.09871 secs
total wall clock time : 4 secs
llil2vec (fixed string length=6) start
get_properties CPU time : 2.2078 secs
emplace set sort CPU time : 0.707252 secs
write stdout CPU time : 1.14867 secs
total CPU time : 4.06383 secs
total wall clock time : 4 secs
llil2vec (fixed string length=6) start
get_properties CPU time : 2.39225 secs
emplace set sort CPU time : 1.0033 secs
write stdout CPU time : 1.22765 secs
total CPU time : 4.62331 secs
total wall clock time : 4 secs
Ave CPU time: 4.3 secs
(Memory use (Windows Private Bytes): 814,940K)
The first one still uses std::string so there are no limitations on word length.
The second one is limited to a word length of six characters, and so works with
big1.txt, while crashing spectacularly with long1.txt.
Funny, I'd originally forgotten that
way back in 2014, when desperately trying to make a search program run a hundred times faster,
I'd stumbled upon this classic quote:
That's not a minor disadvantage, we're talking about things
being 50 to 100 times slower with a linked list.
Compactness matters. Vectors are more compact than lists.
And predictable usage patterns matter enormously.
With a vector, you have to shove a lot of elements over,
but caches are really really good at that.
When you traverse lists you keep doing random access ...
so you are actually random accessing your memory and you
are maximizing your cache misses, which is exactly the
opposite of what you want.
-- Bjarne Stroustrup: Why you should avoid Linked Lists (3:30-5:00) (update: see also)
Back then, because each memory access into my (huge) 4GB lookup tables was essentially random,
most memory accesses missed the L1 cache, missed the L2 cache, missed the L3 cache,
then waited for the cache line to be loaded into all three caches from main memory,
while often incurring a TLB cache miss, just to rub salt into the wounds.
Thankfully, there are no 4 GB lookup tables in this problem.
But hashes (and linked lists) still tend to be mind-bogglingly slower than vectors on modern hardware,
as indicated by Stroustrup's quote above.
So I reluctantly decided that the simple and elegant hash_ret[word] -= count had to go.
Though I really hate this new std::vector based solution, it does run faster than my earlier std::map based one.
This is just a first attempt, further improvements are possible.
The llil2vec.cpp source code is shown below. Though there are two sets of timings, there is only one program,
compiled with or without MAX_STR_LEN_L defined.
Suggestions for improvements welcome.
llil2vec.cpp
| [reply] [d/l] [select] |
|
# gcc 12.2.1
$ g++ -o llil2vec -std=c++11 -Wall -O3 -march=native -mtune=skylake -f
+align-functions=32 -fno-semantic-interposition -mno-vzeroupper -mpref
+er-vector-width=256 llil2vec.cpp
$ ./llil2vec big1.txt big2.txt big3.txt >out.txt
llil2vec (fixed string length=6) start
get_properties CPU time : 1.89609 secs
emplace set sort CPU time : 0.544972 secs
write stdout CPU time : 0.842451 secs
total CPU time : 3.28355 secs
total wall clock time : 4 secs
# clang version 15.0.4
$ clang++ -o llil2vec -std=c++11 -Wall -O3 -march=native -mtune=skylak
+e -falign-functions=32 -fno-semantic-interposition -mno-vzeroupper -m
+prefer-vector-width=256 llil2vec.cpp
$ ./llil2vec big1.txt big2.txt big3.txt >out.txt
llil2vec (fixed string length=6) start
get_properties CPU time : 1.67073 secs
emplace set sort CPU time : 0.474207 secs
write stdout CPU time : 0.828457 secs
total CPU time : 2.97343 secs
total wall clock time : 3 secs
| [reply] [d/l] |
|
|
|
> ./llil2vec big1.txt big2.txt big3.txt >f.tmp
llil2vec start
get_properties CPU time : 3.06708 secs
emplace set sort CPU time : 0.903617 secs
write stdout CPU time : 1.23459 secs
total CPU time : 5.20544 secs
total wall clock time : 5 secs
> ./llil2vec big1.txt big2.txt big3.txt >f.tmp
llil2vec start
get_properties CPU time : 3.4832 secs
emplace set sort CPU time : 1.10209 secs
write stdout CPU time : 0.939805 secs
total CPU time : 5.52519 secs
total wall clock time : 6 secs
> ./llil2vec big1.txt big2.txt big3.txt >f.tmp
llil2vec start
get_properties CPU time : 3.42605 secs
emplace set sort CPU time : 0.916222 secs
write stdout CPU time : 1.00049 secs
total CPU time : 5.343 secs
total wall clock time : 5 secs
> diff f.tmp vec.tmp
This is an average time of 5.35 secs (compared to 5.5 secs in the original post).
New Timings for Short String (MAX_STR_LEN_L=6) Version
> For some reason, short string version is faster when compiled with c
+lang++
> (while long string version is not)
> clang++ -o llil2vec -std=c++11 -Wall -O3 llil2vec.cpp
> ./llil2vec big1.txt big2.txt big3.txt >f.tmp
llil2vec (fixed string length=6) start
get_properties CPU time : 1.87217 secs
emplace set sort CPU time : 0.589238 secs
write stdout CPU time : 0.842179 secs
total CPU time : 3.30369 secs
total wall clock time : 3 secs
> ./llil2vec big1.txt big2.txt big3.txt >f.tmp
llil2vec (fixed string length=6) start
get_properties CPU time : 1.95909 secs
emplace set sort CPU time : 0.610479 secs
write stdout CPU time : 0.959859 secs
total CPU time : 3.52965 secs
total wall clock time : 4 secs
> ./llil2vec big1.txt big2.txt big3.txt >f.tmp
llil2vec (fixed string length=6) start
get_properties CPU time : 1.86549 secs
emplace set sort CPU time : 0.608097 secs
write stdout CPU time : 0.857176 secs
total CPU time : 3.33087 secs
total wall clock time : 3 secs
> diff f.tmp vec.tmp
This is an average time of 3.4 secs (compared to 4.3 secs in the original post).
New version of llil2vec.cpp
Interim OpenMP Versions
Note: The interim versions below are just early (conservative) openmp-safe versions made thread-safe by
expunging the dreaded strtok function.
The one with the most potential for multi-threading seems to be llil3vec.cpp because
it uses vectors which can be safely used lock-free in thread-local vec_loc vectors, as shown by
marioroy in llil4vec.cpp.
While replacing std::sort and std::stable_sort with __gnu_parallel versions to sort std::vectors
is a no-brainer, it seems problematic to parallelise access to a global std::map (as used in llil3grt.cpp and llil3m.cpp below)
because I think you'd need some sort of locking to safely allow multiple threads update a shared global std::map.
Ideas welcome.
Interim version of llil3vec.cpp
Timings of llil3vec
> g++ -o llil3vec -std=c++20 -Wall -O3 llil3vec.cpp
> time ./llil3vec big1.txt big2.txt big3.txt >f.tmp
llil3vec (fixed string length=6) start
get_properties CPU time : 1.47895 secs
emplace set sort CPU time : 0.592783 secs
write stdout CPU time : 0.832071 secs
total CPU time : 2.90392 secs
total wall clock time : 3 secs
real 0m3.217s
user 0m2.806s
sys 0m0.412s
> diff f.tmp vec.tmp
> g++ -o llil3vec -std=c++20 -fopenmp -Wall -O3 llil3vec.cpp
> time ./llil3vec big1.txt big2.txt big3.txt >f.tmp
llil3vec (fixed string length=6) start
get_properties CPU time : 2.5809 secs
emplace set sort CPU time : 0.793355 secs
write stdout CPU time : 0.860855 secs
total CPU time : 4.23521 secs
total wall clock time : 3 secs
real 0m2.673s
user 0m4.073s
sys 0m0.465s
> diff f.tmp vec.tmp
For the -fopenmp version note that the Unix real time is lower, but the user time is higher.
Interim version of llil3grt.cpp
Interim version of llil3m.cpp
Timings
llil3grt (fixed string length=6) start
- openmp version
get_properties CPU time : 2.34487 secs
emplace set sort CPU time : 0.942098 secs
write stdout CPU time : 0.855157 secs
total CPU time : 4.14223 secs
total wall clock time : 4 secs
real 0m4.752s
user 0m4.241s
sys 0m0.511s
$ time ./llil3m big1.txt big2.txt big3.txt >f.tmp
llil3m (fixed string length=6) start
- openmp version
get_properties CPU time : 2.40089 secs
vector copy CPU time : 0.544294 secs
vector stable sort CPU time : 5.84981 secs
write stdout CPU time : 0.805509 secs
total CPU time : 9.6006 secs
total wall clock : 4 secs
real 0m4.713s
user 0m9.238s
sys 0m0.679s
llil3m (fixed string length=6) start
- openmp version
get_properties CPU time : 2.3031 secs
vector copy CPU time : 0.544613 secs
vector sort CPU time : 2.64047 secs
write stdout CPU time : 0.791186 secs
total CPU time : 6.27946 secs
total wall clock : 4 secs
real 0m4.182s
user 0m6.128s
sys 0m0.473s
$ time ./llil3m big1.txt big2.txt big3.txt >f.tmp
llil3m (fixed string length=6) start
get_properties CPU time : 2.28072 secs
vector copy CPU time : 0.471632 secs
vector stable sort CPU time : 0.735042 secs
write stdout CPU time : 0.67514 secs
total CPU time : 4.16263 secs
total wall clock : 4 secs
real 0m4.470s
user 0m4.159s
sys 0m0.311s
$ time ./llil3m big1.txt big2.txt big3.txt >f.tmp
llil3m (fixed string length=6) start
get_properties CPU time : 2.30618 secs
vector copy CPU time : 0.473185 secs
vector sort CPU time : 1.13081 secs
write stdout CPU time : 0.668702 secs
total CPU time : 4.57897 secs
total wall clock : 4 secs
real 0m4.889s
user 0m4.558s
sys 0m0.331s
$ time ./llil3vec big1.txt big2.txt big3.txt >f.tmp
llil3vec (fixed string length=6) start
get_properties CPU time : 1.46864 secs
emplace set sort CPU time : 0.630914 secs
write stdout CPU time : 0.847912 secs
total CPU time : 2.9476 secs
total wall clock time : 3 secs
real 0m3.233s
user 0m2.852s
sys 0m0.381s
$ time ./llil3grt big1.txt big2.txt big3.txt >f.tmp
llil3grt (fixed string length=6) start
get_properties CPU time : 2.34418 secs
emplace set sort CPU time : 0.901415 secs
write stdout CPU time : 0.90025 secs
total CPU time : 4.14595 secs
total wall clock time : 5 secs
real 0m4.784s
user 0m4.232s
sys 0m0.551s
References Added Later
Updated: The source code for llil2vec.cpp was updated after the node was first created
based on feedback from the replies. Added interim version llil3vec.cpp (with some marioroy improvements added).
Plus interim versions of llil3grt.cpp and llil3m.cpp.
Added a few missing #include <array> and reference for OpenMP Little Book (thanks marioroy).
| [reply] [d/l] [select] |
|
|
|
|
|
|
|
The good news is... The bad news is...
The good news is that I bring good news only! :) Modified J script is faster, more versatile, uses significantly less RAM, and has been tested with 9.04 engine to parallelize obvious low hanging fruits for additional speed boost.
NB. -----------------------------------------------------------
NB. --- This file is "llil4.ijs"
NB. --- Run as e.g.:
NB.
NB. jconsole.exe llil4.ijs big1.txt big2.txt big3.txt out.txt
NB.
NB. --- (NOTE: last arg is output filename, file is overwritten)
NB. -----------------------------------------------------------
pattern =: 0 1
NB. ========> This line has a star in its right margin =======> NB. *
args =: 2 }. ARGV
fn_out =: {: args
fn_in =: }: args
NB. PAD_CHAR =: ' '
filter_CR =: #~ ~: & CR
make_more_space =: ' ' I. @ ((LF = ]) +. (TAB = ])) } ]
find_spaces =: I. @: = & ' '
read_file =: {{
'fname pattern' =. y
text =. make_more_space filter_CR fread fname
selectors =. (|.!.0 , {:) >: find_spaces text
width =. # pattern
height =. width <. @ %~ # selectors
append_diffs =. }: , 2& (-~/\)
shuffle_dims =. (1 0 3 & |:) @ ((2, height, width, 1) & $)
selectors =. append_diffs selectors
selectors =. shuffle_dims selectors
literal =. < @: (}:"1) @: (];. 0) & text "_1
numeric =. < @: (0&".) @: (; @: (<;. 0)) & text "_1
extract =. pattern & {
using =. 1 & \
or_maybe =. `
,(extract literal or_maybe numeric) using selectors
}}
read_many_files =: {{
'fnames pattern' =. y
,&.>/"2 (-#pattern) ]\ ,(read_file @:(; &pattern)) "0 fnames NB. *
}}
'words nums' =: read_many_files fn_in ; pattern
t1 =: (6!:1) '' NB. time since engine start
'words nums' =: (~. words) ; words +//. nums NB. *
'words nums' =: (\: nums)& { &.:>"_1 words ; nums
words =: ; nums < @ /:~/. words
t2 =: (6!:1) '' NB. time since engine start
text =: , words ,. TAB ,. (": ,. nums) ,. LF
erase 'words' ; 'nums'
text =: (#~ ~: & ' ') text
text fwrite fn_out
erase < 'text'
t3 =: (6!:1) '' NB. time since engine start
echo 'Read and parse input: ' , ": t1
echo 'Classify, sum, sort: ' , ": t2 - t1
echo 'Format and write output: ' , ": t3 - t2
echo 'Total time: ' , ": t3
echo ''
echo 'Finished. Waiting for a key...'
stdin ''
exit 0
Code above doesn't (yet) include any 9.04 features and runs OK with 9.03, but I found 9.04 slightly faster in general. I also found 9.04 a bit faster on Windows, it's opposite to what I have seen for 9.03 (script ran faster on Linux), let's shrug it off because of 9.04 beta status and/or my antique PC. Results below are for beta 9.04 on Windows 10 (RAM usage taken from Windows Task Manager):
> jconsole.exe llil4.ijs big1.txt big2.txt big3.txt out.txt
Read and parse input: 1.501
Classify, sum, sort: 2.09
Format and write output: 1.318
Total time: 4.909
Finished. Waiting for a key...
Peak working set (memory): 376,456K
There are 3 star-marked lines. To patch for 9.04 new features to enable parallelization, replace them with these counterparts:
{{ for. i. 3 do. 0 T. 0 end. }} ''
,&.>/"2 (-#pattern) ]\ ,;(read_file @:(; &pattern)) t.'' "0 fnames
'words nums' =: (~.t.'' words) , words +//. t.'' nums
As you see, 1st line replaces comment, 2nd and 3d lines require just minor touches. 2nd line launches reading and parsing of input files in parallel. 3d line says to parallelize filtering for unique words and summing numbers according to words classification. Kind of redundant double work, even as it was, as I see it. The 1st line starts 3 additional worker threads. I don't have more cores with my CPU anyway, and this script has no work easily dispatched to more workers. Then:
Read and parse input: 0.992
Classify, sum, sort: 1.849
Format and write output: 1.319
Total time: 4.16
I would call my parallelization attempt, however crude it was, a success. Next is output for our second "official" dataset in this thread:
> jconsole.exe llil4.ijs long1.txt long2.txt long3.txt out.txt
Read and parse input: 1.329
Classify, sum, sort: 0.149
Format and write output: 0.009
Total time: 1.487
########################################################
These are my results for latest C++ solution (compiled using g++), to compare my efforts to:
$ ./llil2vec_11149482 big1.txt big2.txt big3.txt >vec.tmp
llil2vec start
get_properties CPU time : 3.41497 secs
emplace set sort CPU time : 1.04229 secs
write stdout CPU time : 1.31578 secs
total CPU time : 5.77311 secs
total wall clock time : 5 secs
$ ./llil2vec_11149482 long1.txt long2.txt long3.txt >vec.tmp
llil2vec start
get_properties CPU time : 1.14889 secs
emplace set sort CPU time : 0.057158 secs
write stdout CPU time : 0.003307 secs
total CPU time : 1.20943 secs
total wall clock time : 2 secs
$ ./llil2vec_11149482 big1.txt big2.txt big3.txt >vec.tmp
llil2vec (fixed string length=6) start
get_properties CPU time : 2.43187 secs
emplace set sort CPU time : 0.853877 secs
write stdout CPU time : 1.33636 secs
total CPU time : 4.62217 secs
total wall clock time : 5 secs
I noticed that new C++ code, supposed to be faster, is actually slower (compared to llil2grt) with "long" dataset from two "official" datasets used in this thread. | [reply] [d/l] [select] |
|
|
|
|
Re: Rosetta Code: Long List is Long
by Anonymous Monk on Dec 02, 2022 at 00:22 UTC
|
Maybe it is not worth anything, but just exploiting a loophole of ridiculous 6 bytes a..z fixed length keys, and count fitting single byte. $r is just to quickly convert decimal to/from base-27, nothing more. Sure it all could be optimized. Excluding 0 from base-27 representation (i.e. using 27 instead of 26) is to avoid padding with leading zeroes. It follows that $buf_in length is wasteful 387,420,489 instead of 308,915,776, but, what the heck, I can be royally wasteful with this implementation. Decrementing count down from 255 instead of just incrementing from 0 is further silly optimization which I see this late time of day, but sure oversee better improvements. + Of course $MAX is 27**6 (or would be 26**6).
Can't confirm what eyepopslikeamosquito says about memory, with original llil.pl I see Working Set goes to approx. 2.9 GB. Mine doesn't exceed ~530 Mb.
llil start
get_properties : 15 secs
sort + output : 103 secs
total : 118 secs
my_test start
get_properties : 31 secs
sort + output: 10 secs
total: 41 secs
...
use strict;
use warnings;
use feature 'say';
use Math::GMPz ':mpz';
use Sort::Packed 'sort_packed';
@ARGV or die "usage: $0 file...\n";
my @llil_files = @ARGV;
warn "my_test start\n";
my $tstart1 = time;
my $r = Rmpz_init;
Rmpz_set_str( $r, 'zzzzzz' =~ tr/a-z1-9/1-9a-z/r, 27 );
my $MAX = Rmpz_get_ui( $r );
my ( $buf_in, $buf_out ) = ( "\xFF" x $MAX, '' );
for my $fname ( @llil_files ) {
open( my $fh, '<', $fname ) or die "error: open '$fname': $!";
while ( <$fh> ) {
chomp;
my ( $word, $count ) = split /\t/;
$word =~ tr/a-z1-9/1-9a-z/;
Rmpz_set_str( $r, $word, 27 );
vec( $buf_in, Rmpz_get_ui( $r ), 8 ) -= $count;
}
close( $fh ) or die "error: close '$fname': $!";
}
while ( $buf_in =~ /[^\xFF]/g ) {
Rmpz_set_ui( $r, @- );
$buf_out .= pack 'aa6', $&, Rmpz_get_str( $r, 27 ) =~ tr/1-9a-z/a-
+z/r
}
my $tend1 = time;
warn "get_properties : ", $tend1 - $tstart1, " secs\n";
my $tstart2 = time;
sort_packed C7 => $buf_out;
while ( $buf_out ) {
my ( $count, $word ) = unpack 'Ca6', substr $buf_out, 0, 7, '';
printf "%s\t%d\n", $word, 255 - $count
}
my $tend2 = time;
warn "sort + output: ", $tend2 - $tstart2, " secs\n";
warn "total: ", $tend2 - $tstart1, " secs\n";
What follows is fiction, not implemented in code, can be ignored. I said 'ridiculous' above, but in fact I do remember original LLIL thread, not sure now but then I thought keys were expected significantly longer than qw/foo bar aaaaaa/, etc. (genetic sequences?). So then this would be multi-GB total of input files which are mostly keys, just keeping them keys in RAM is out of the question. Not to mention building and keeping hashes and working with them.
I thought about HQ hashing (xxHash?) of keys and sparsely storing (Judy?) values indexed by produced integer, where values are e.g. 64-bit-packed integer comprised of
- file id
- offset of start of line containing unique key first seen (tell)
- count (updated as files are read in)
After all files are consumed, value positions within array (i.e. indexes) are no longer important. IF densely-packed array data (i.e. discard zero values) fits RAM, then problem is solved. Sort packed data, and produce output, which, yes, would require randomly reading lines (i.e. real keys) from input files AGAIN based on stored file id and line position. | [reply] [d/l] [select] |
|
llil2d start
get_properties : 10 secs
sort + output : 21 secs
total : 31 secs
2317524 Kbytes of RAM were used
(a report about resident RAM size was produced with same few lines of code as in script below)
In fact, "words" are so short (and few) that I realized (later, after I wrote 11148660 with results there) that I don't need any indirection and can simply use Judy::HS: "words" are kept in RAM all the time, both in my in-memory flat "DB" and, opaquely, somewhere in Judy code.
And then there's solution in Perl, which is both faster and requires much less memory, and generates identical output:
my_test start
get_properties : 13 secs
sort + output : 7 secs
total : 20 secs
349124 Kbytes of RAM were used
Short integer to keep count is not required with this example, it only demonstrates count can be any length and not limited to a byte as in my previous "solution" in this thread. Same about 10 bytes to pad "words" to: it can be anything to fit longest word; words can be different in length.
use strict;
use warnings;
use Judy::HS qw/ Set Get Free /;
use Sort::Packed 'sort_packed';
my $DATA_TEMPLATE = 'nZ10';
my $DATA_SIZE = 12;
my $COUNT_SIZE_BYTES = 2;
my $COUNT_SIZE_BITS = 16;
my $COUNT_MAX = int( 2 ** $COUNT_SIZE_BITS - 1 );
@ARGV or die "usage: $0 file...\n";
my @llil_files = @ARGV;
warn "my_test start\n";
my $tstart1 = time;
my ( $data, $current ) = ( '', 0 );
my $judy;
for my $fname ( @llil_files ) {
open( my $fh, '<', $fname ) or die $!;
while ( <$fh> ) {
chomp;
my ( $word, $count ) = split /\t/;
( undef, my $val ) = Get( $judy, $word );
if ( defined $val ) {
vec( $data, $val * $DATA_SIZE / $COUNT_SIZE_BYTES,
$COUNT_SIZE_BITS ) -= $count
}
else {
$data .= pack $DATA_TEMPLATE, $COUNT_MAX - $count, $word;
Set( $judy, $word, $current );
$current ++
}
}
}
Free( $judy );
my $tend1 = time;
warn "get_properties : ", $tend1 - $tstart1, " secs\n";
my $tstart2 = time;
sort_packed "C$DATA_SIZE", $data;
while ( $data ) {
my ( $count, $word ) = unpack $DATA_TEMPLATE, substr $data, 0, $DA
+TA_SIZE, '';
printf "%s\t%d\n", $word, $COUNT_MAX - $count
}
my $tend2 = time;
warn "sort + output : ", $tend2 - $tstart2, " secs\n";
warn "total : ", $tend2 - $tstart1, " secs\n";
use Memory::Usage;
my $m = Memory::Usage-> new;
$m-> record;
warn $m-> state-> [0][3], " Kbytes of RAM were used\n";
What if "words" are significantly longer? With approx. 10e6 unique words in this test, if they were each hundreds of bytes, then several GB of RAM would be used just to keep them. Perhaps impractical.
So I'm returning to my idea of keeping only hashes and offsets within files. Results I posted in 11148660 are of course valid, I was using 64-bit hashes as Judy::L integer keys.
In fact 32-bit hashes generated with xxHash have a few collisions for 10e6 6-character words. 64-bit hashes have no collisions. I'm not qualified to predict how safe it is to expect no collisions for set of which size. Maybe 128-bit hashes, below, are overkill.
Just for entertainment I decided to write "indirect" solution with 128-bit hashes. Therefore I'm not using Judy::L, but same Judy::HS with keys being 32-char hex-encoded hashes. Otherwise, almost same plan and code. Data template layout chosen arbitrarily and can be adjusted.
Of course, words are not alphabetically sorted in output, but OTOH it was not original LLIL requirement. Results:
my_test start
get_properties : 21 secs
sort + output : 23 secs
total : 44 secs
841880 Kbytes of RAM were used
I think same amount of RAM would be used if words were not 6 but 600 or 6000 bytes. Relatively fast 2nd phase (considering huge amount of random reads) is due to NMVe storage here.
(Off topic: I managed to install Crypt::xxHash under Windows/Strawberry, but Judy was too much challenge for me. I wrote solution very close to code below, using not Judy, but very thin Inline::C wrapper for AVL library (google for jsw_avltree). It uses approx. same RAM and ~2x time for 1st phase. But also ~2.5x time for 2nd phase, which is exactly same code with same hardware (dual boot). Don't know if Windows I/O is just so much slower)
use strict;
use warnings;
use Judy::HS qw/ Set Get Free /;
use Crypt::xxHash 'xxhash3_128bits_hex';
use Sort::Packed 'sort_packed';
my $DATA_TEMPLATE = 'nnNn'; # word count
# file index
# word position
# word length
my $DATA_SIZE = 10;
my $COUNT_SIZE_BYTES = 2;
my $COUNT_SIZE_BITS = 16;
my $COUNT_MAX = int( 2 ** $COUNT_SIZE_BITS - 1 );
@ARGV or die "usage: $0 file...\n";
my @llil_files = @ARGV;
warn "my_test start\n";
my $tstart1 = time;
my ( $data, $current ) = ( '', 0 );
my $judy;
for my $idx ( 0 .. $#llil_files ) {
open( my $fh, '<', $llil_files[ $idx ]) or die $!;
until ( eof $fh ) {
my $pos = tell $fh;
$_ = <$fh>;
chomp;
my ( $word, $count ) = split /\t/;
my $xx = xxhash3_128bits_hex( $word, 0 );
( undef, my $val ) = Get( $judy, $xx );
if ( defined $val ) {
vec( $data, $val * $DATA_SIZE / $COUNT_SIZE_BYTES,
$COUNT_SIZE_BITS ) -= $count
}
else {
$data .= pack $DATA_TEMPLATE,
$COUNT_MAX - $count,
$idx,
$pos,
length $word;
Set( $judy, $xx, $current );
$current ++
}
}
}
Free( $judy );
my $tend1 = time;
warn "get_properties : ", $tend1 - $tstart1, " secs\n";
my $tstart2 = time;
sort_packed "C$DATA_SIZE", $data;
my @fh;
open $fh[ $_ ], '<', $llil_files[ $_ ] for 0 .. $#llil_files;
while ( $data ) {
my ( $count, $idx, $pos, $len )
= unpack $DATA_TEMPLATE, substr $data, 0, $DATA_SIZE, '';
sysseek $fh[ $idx ], $pos, 0;
sysread $fh[ $idx ], my( $word ), $len;
printf "%s\t%d\n", $word, $COUNT_MAX - $count
}
my $tend2 = time;
warn "sort + output : ", $tend2 - $tstart2, " secs\n";
warn "total : ", $tend2 - $tstart1, " secs\n";
use Memory::Usage;
my $m = Memory::Usage-> new;
$m-> record;
warn $m-> state-> [0][3], " Kbytes of RAM were used\n";
| [reply] [d/l] [select] |
|
$ diff llil_judyhs1.pl llil_judyhs2.pl
7,10c7,10
< my $DATA_TEMPLATE = 'nZ10';
< my $DATA_SIZE = 12;
< my $COUNT_SIZE_BYTES = 2;
< my $COUNT_SIZE_BITS = 16;
---
> my $DATA_TEMPLATE = 'NZ20';
> my $DATA_SIZE = 24;
> my $COUNT_SIZE_BYTES = 4;
> my $COUNT_SIZE_BITS = 32;
$ time perl llil_judyhs1.pl big1.txt big2.txt big3.txt >out1.txt
my_test start
get_properties : 10 secs
sort + output : 5 secs
total : 15 secs
353468 Kbytes of RAM were used
real 0m14.770s
user 0m14.669s
sys 0m0.084s
$ time perl llil_judyhs2.pl big1.txt big2.txt big3.txt >out2.txt
my_test start
get_properties : 10 secs
sort + output : 5 secs
total : 15 secs
473784 Kbytes of RAM were used
real 0m15.073s
user 0m14.938s
sys 0m0.119s
| [reply] [d/l] [select] |
|
|
Thanks anonymonk. Excellent work!
Though I've never used them, I've heard good things about Judy Arrays and maintain a list of references on them at PM.
Might get around to actually using them one day. :)
What if "words" are significantly longer? With approx. 10e6 unique words in this test, if they were each hundreds of bytes,
then several GB of RAM would be used just to keep them. Perhaps impractical.
Good question! Apologies, my initial test file generator was very primitive.
To try to help answer your question I've quickly whipped up a test file generator that
generates much longer keys (up to around 200 characters in length) and longer counts too.
I was conservative with the counts because I didn't want to disqualify folks using 32-bit ints.
# gen-long-llil.pl
# Crude program to generate a LLiL test file with long names and count
+s
# perl gen-long-llil.pl long1.txt 600
use strict;
use warnings;
use autodie;
{
my $ordmin = ord('a');
my $ordmax = ord('z') + 1;
# Generate a random word
sub gen_random_word {
my $word = shift; # word prefix
my $nchar = shift; # the number of random chars to append
for my $i (1 .. $nchar) {
$word .= chr( $ordmin + int( rand($ordmax - $ordmin) ) );
}
return $word;
}
}
my $longworda = join '', 'a' .. 'z';
my $longwordz = join '', reverse('a' .. 'z');
my $longcount = 1_000_000;
sub create_long_test_file {
my $fname = shift;
my $howmany = shift;
open( my $fh_out, '>', $fname );
# Some with no randomness
for my $h ( 1 .. $howmany ) {
for my $i ( 1 .. 8 ) {
my $cnt = $longcount + $i - 1;
my $worda = $longworda x $i;
my $wordz = $longwordz x $i;
print {$fh_out} "$worda\t$cnt\n$wordz\t$cnt\n";
}
}
# Some with randomness
my $wordlen = 1;
for my $h ( 1 .. $howmany ) {
for my $i ( 1 .. 8 ) {
my $cnt = $longcount + $i - 1;
my $worda = $longworda x $i;
my $wordz = $longwordz x $i;
for my $c ( 'a' .. 'z' ) {
for my $z ( 1 .. 2 ) {
print {$fh_out} $worda . gen_random_word( $c, $wordlen
+) . "\t" . (1000000 + $z) . "\n";
print {$fh_out} $wordz . gen_random_word( $c, $wordlen
+) . "\t" . (1000000 + $z) . "\n";
}
}
}
}
}
my $outfile = shift;
my $count = shift;
$outfile or die "usage: $0 outfile count\n";
$count or die "usage: $0 outfile count\n";
$count =~ /^\d+$/ or die "error: count '$count' is not a number\n";
print "generating short long test file '$outfile' with count '$count'\
+n";
create_long_test_file( $outfile, $count );
print "file size=", -s $outfile, "\n";
I ran it like this:
> perl gen-long-llil.pl long1.txt 600
generating short long test file 'long1.txt' with count '600'
file size=65616000
> perl gen-long-llil.pl long2.txt 600
generating short long test file 'long2.txt' with count '600'
file size=65616000
> perl gen-long-llil.pl long3.txt 600
generating short long test file 'long3.txt' with count '600'
file size=65616000
Then reran my two biggish benchmarks with a mixture of files:
> perl llil2d.pl big1.txt big2.txt big3.txt long1.txt long2.txt long3.
+txt >perl2.tmp
llil2d start
get_properties : 11 secs
sort + output : 23 secs
total : 34 secs
> llil2a big1.txt big2.txt big3.txt long1.txt long2.txt long3.txt >cpp
+2.tmp
llil2 start
get_properties : 6 secs
sort + output : 5 secs
total : 11 secs
> diff cpp2.tmp perl2.tmp
Improved test file generators welcome.
Updated Test File Generators
These were updated to allow a "\n" (rather than "\r\n") on Windows
after this was pointed out here.
Curiously, \n seems to be slower than \r\n on Windows if you don't set binmode!
I am guessing that chomp is slower with \n than with \r\n on a Windows text stream.
gen-llil.pl
# gen-llil.pl
# Crude program to generate a big LLiL test file to use in benchmarks
# On Windows running:
# perl gen-llil.pl big2.txt 200 3 - produces a test file with size
+ = 35,152,000 bytes
# (lines terminated with "\r\n")
# perl gen-llil.pl big2.txt 200 3 1 - produces a test file with size
+ = 31,636,800 bytes
# (lines terminated with "\n")
# On Unix, lines are terminated with "\n" and the file size is always
+31,636,800 bytes
use strict;
use warnings;
use autodie;
{
my $ordmin = ord('a');
my $ordmax = ord('z') + 1;
# Generate a random word
sub gen_random_word {
my $word = shift; # word prefix
my $nchar = shift; # the number of random chars to append
for my $i (1 .. $nchar) {
$word .= chr( $ordmin + int( rand($ordmax - $ordmin) ) );
}
return $word;
}
}
sub create_test_file {
my $fname = shift;
my $count = shift;
my $wordlen = shift;
my $fbin = shift;
open( my $fh_out, '>', $fname );
$fbin and binmode($fh_out);
for my $c ( 'aaa' .. 'zzz' ) {
for my $i (1 .. $count) {
print {$fh_out} gen_random_word( $c, $wordlen ) . "\t" . 1 .
+"\n";
}
}
}
my $outfile = shift;
my $count = shift;
my $wordlen = shift;
my $fbin = shift; # default is to use text stream (not a binary
+stream)
defined($fbin) or $fbin = 0;
$outfile or die "usage: $0 outfile count wordlen\n";
$count or die "usage: $0 outfile count wordlen\n";
print "generating test file '$outfile' with count '$count' (binmode=$f
+bin)\n";
create_test_file($outfile, $count, $wordlen, $fbin);
print "file size=", -s $outfile, "\n";
gen-long-llil.pl
# gen-long-llil.pl
# Crude program to generate a LLiL test file with long names and count
+s
# perl gen-long-llil.pl long1.txt 600
# On Windows running:
# perl gen-long-llil.pl long1.txt 600 - produces a test file with s
+ize = 65,616,000 bytes
# (lines terminated with "\r\n
+")
# perl gen-long-llil.pl long1.txt 600 - produces a test file with s
+ize = 65,107,200 bytes
# (lines terminated with "\n")
# On Unix, lines are terminated with "\n" and the file size is always
+65,107,200 bytes
use strict;
use warnings;
use autodie;
{
my $ordmin = ord('a');
my $ordmax = ord('z') + 1;
# Generate a random word
sub gen_random_word {
my $word = shift; # word prefix
my $nchar = shift; # the number of random chars to append
for my $i (1 .. $nchar) {
$word .= chr( $ordmin + int( rand($ordmax - $ordmin) ) );
}
return $word;
}
}
my $longworda = join '', 'a' .. 'z';
my $longwordz = join '', reverse('a' .. 'z');
my $longcount = 1_000_000;
sub create_long_test_file {
my $fname = shift;
my $howmany = shift;
my $fbin = shift;
open( my $fh_out, '>', $fname );
$fbin and binmode($fh_out);
# Some with no randomness
for my $h ( 1 .. $howmany ) {
for my $i ( 1 .. 8 ) {
my $cnt = $longcount + $i - 1;
my $worda = $longworda x $i;
my $wordz = $longwordz x $i;
print {$fh_out} "$worda\t$cnt\n$wordz\t$cnt\n";
}
}
# Some with randomness
my $wordlen = 1;
for my $h ( 1 .. $howmany ) {
for my $i ( 1 .. 8 ) {
my $cnt = $longcount + $i - 1;
my $worda = $longworda x $i;
my $wordz = $longwordz x $i;
for my $c ( 'a' .. 'z' ) {
for my $z ( 1 .. 2 ) {
print {$fh_out} $worda . gen_random_word( $c, $wordlen
+) . "\t" . (1000000 + $z) . "\n";
print {$fh_out} $wordz . gen_random_word( $c, $wordlen
+) . "\t" . (1000000 + $z) . "\n";
}
}
}
}
}
my $outfile = shift;
my $count = shift;
my $fbin = shift; # default is to use text stream (not a binary
+stream)
defined($fbin) or $fbin = 0;
$outfile or die "usage: $0 outfile count\n";
$count or die "usage: $0 outfile count\n";
$count =~ /^\d+$/ or die "error: count '$count' is not a number\n";
print "generating short long test file '$outfile' with count '$count'
+(binmode=$fbin)\n";
create_long_test_file( $outfile, $count, $fbin );
print "file size=", -s $outfile, "\n";
Updated this node with new test file generators so you can generate test files that are the same size on Unix and Windows.
That is, by setting $fbin you can make the line ending "\n" on Windows, instead of "\r\n".
See Re^2: Rosetta Code: Long List is Long (faster) for more background.
| [reply] [d/l] [select] |
|
Re: Rosetta Code: Long List is Long - JudySL code
by marioroy (Prior) on Jan 24, 2023 at 02:50 UTC
|
It has been a long journey. I'm completing my participation with a JudySL array implementation.
shuffle.pl:
The mini-shuffle script ensures random data for simulating worst case scenario.
#!/usr/bin/env perl
use strict;
use warnings;
use List::Util 'shuffle';
my @arr = shuffle <>;
print @arr;
./shuffle.pl big1.txt >tmp && mv tmp big1.txt
./shuffle.pl big2.txt >tmp && mv tmp big2.txt
./shuffle.pl big3.txt >tmp && mv tmp big3.txt
./shuffle.pl long1.txt >tmp && mv tmp long1.txt
./shuffle.pl long2.txt >tmp && mv tmp long2.txt
./shuffle.pl long3.txt >tmp && mv tmp long3.txt
Running:
The testing consumes one core for get_properties and parallel processing (8 threads) for sorting. The MAX_STR_LEN_L define is commented out.
$ NUM_THREADS=1 ./llil4judy-no6 big{1,2,3}.txt long{1,2,3}.txt >out.tx
+t
llil4judy start
use OpenMP
use boost sort
get properties 2.336 secs
judy to vector 0.252 secs
vector stable sort 0.566 secs
write stdout 0.220 secs
total time 3.375 secs
$ NUM_THREADS=1 ./llil4map-no6-flat big{1,2,3}.txt long{1,2,3}.txt >ou
+t.txt
llil4map start
use OpenMP
use boost sort
get properties 1.847 secs
map to vector 0.198 secs
vector stable sort 0.513 secs
write stdout 0.224 secs
total time 2.784 secs
$ NUM_THREADS=1 ./llil4vec-no6 big{1,2,3}.txt long{1,2,3}.txt >out.txt
llil4vec start
use OpenMP
use boost sort
get properties 0.510 secs
sort properties 1.046 secs
vector reduce 0.266 secs
vector stable sort 1.125 secs
write stdout 0.283 secs
total time 3.233 secs
The J script solution from Anonymous Monk.
~/j904/bin/jconsole llil5.ijs big{1,2,3}.txt long{1,2,3}.txt out.txt
Read and parse input: 2.44621
Classify, sum, sort: 4.82435
Format and write output: 2.05906
Total time: 9.32962
llil4judy.cc:
| [reply] [d/l] [select] |
|
Note: Chuma states, "If it matters – the original lists are sorted, they don't all contain the same words, all numbers are positive integers." For clarity, the lists are sorted by (value descending, word).
Links to file generators and llil4 variants:
gen-llil.pl, gen-long-llil.pl, shuffle.pl
llil4map,
llil4map2,
llil4hmap,
llil4emh,
llil4umap
llil4vec, llil4vec-tbb
llil4judy
April 2024 updates:
llil4map2 is a memory efficient version, using a vector of pointers, to the phmap key-value pairs. This results in sorting taking longer, possibly due to CPU cache misses. However, memory consumption is significantly less.
200mil unique keys llil4map llil4map2
------------------ -------- --------
fixed-length=12 ~ 8.0 GB 6.0 GB
long-strings 17.8 GB 12.3 GB
See also, a more memory efficient version, by Gregory Popovitch, author of the C++ parallel-hashmap library.
| [reply] [d/l] [select] |
|
|