Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

Rosetta Code: Long List is Long

by eyepopslikeamosquito (Bishop)
on Nov 30, 2022 at 22:27 UTC ( #11148465=perlmeditation: print w/replies, xml ) Need Help??

I've long found it fun to implement the same algorithm in different languages, especially Perl and C++ ... and then sit back and reflect on the lessons learned ... so when Long list is long appeared recently, I felt it was short and interesting enough to make an excellent Rosetta code node.

Solutions to this problem must read a number of input LLiL-format files (given as command line arguments) and write a single merged LLiL-format file to stdout. The LLiL-format is described in the comments at the top of llil.pl below.

In the interests of keeping the code as short and fast as possible, you may assume the input LLiL files are well-formed. For example, you don't need to check for and remove leading and trailing whitespace on each line. The sample solutions given below in Perl and C++ should clarify program requirements.

Please feel free to respond away with solutions to this problem in your favourite programming language and to offer suggested improvements to my sample Perl and C++ solutions below.

Perl Solution

Here's my Perl solution, heavily influenced by responses to Long list is long, especially kcott's concise and clear solution:

# llil.pl # Example run: perl llil.pl tt1.txt tt2.txt >oo1.tmp use strict; use warnings; # -------------------------------------------------------------------- +-- # LLiL specification # ------------------ # A LLiL-format file is a text file. # Each line consists of a lowercase name a TAB character and a non-neg +ative integer count. # That is, each line must match : ^[a-z]+\t\d+$ # For example, reading the LLiL-format files, tt1.txt containing: # camel\t42 # pearl\t94 # dromedary\t69 # and tt2.txt containing: # camel\t8 # hello\t12345 # dromedary\t1 # returns this hashref: # $hash_ret{"camel"} = 50 # $hash_ret{"dromedary"} = 70 # $hash_ret{"hello"} = 12345 # $hash_ret{"pearl"} = 94 # That is, values are added for items with the same key. # # To get the required LLiL text, you must sort the returned hashref # descending by value and insert a TAB separator: # hello\t12345 # pearl\t94 # dromedary\t70 # camel\t50 # To make testing via diff easier, we further sort ascending by name # for lines with the same value. # -------------------------------------------------------------------- +-- # Function get_properties # Read a list of LLiL-format files # Return a reference to a hash of properties sub get_properties { my $files = shift; # in: reference to a list of LLiL-format fil +es my %hash_ret; # out: reference to a hash of properties for my $fname ( @{$files} ) { open( my $fh, '<', $fname ) or die "error: open '$fname': $!"; while (<$fh>) { chomp; my ($word, $count) = split /\t/; $hash_ret{$word} += $count; } close($fh) or die "error: close '$fname': $!"; } return \%hash_ret; } # ----------------- mainline ----------------------------------------- +-- @ARGV or die "usage: $0 file...\n"; my @llil_files = @ARGV; warn "llil start\n"; my $tstart1 = time; my $href = get_properties( \@llil_files ); my $tend1 = time; my $taken1 = $tend1 - $tstart1; warn "get_properties : $taken1 secs\n"; my $tstart2 = time; for my $key ( sort { $href->{$b} <=> $href->{$a} || $a cmp $b } keys % +{$href} ) { print "$key\t$href->{$key}\n"; } my $tend2 = time; my $taken2 = $tend2 - $tstart2; my $taken = $tend2 - $tstart1; warn "sort + output : $taken2 secs\n"; warn "total : $taken secs\n";

What makes this problem interesting to me is the requirement to sort the hash in descending order by value:

sort { $href->{$b} <=> $href->{$a} || $a cmp $b } keys %{$href}
because the performance of such a sort may suffer when dealing with huge files (after all, performance was the reason for the OP's question in the first place).

I'm hoping solving this problem in multiple languages will be fun and instructive -- and perhaps give us insight into how performance changes as the number of items increases.

C++ Solution

Here's my C++ solution:

// llil.cpp. C++ 11 version of Perl llil.pl. // g++ compile on Linux: // g++ -o llil -std=c++11 -Wall -O3 llil.cpp // This g++ command also works with mingw C++ compiler (https://source +forge.net/projects/mingw-w64) // that comes bundled with Strawberry Perl (C:\Strawberry\c\bin\g++.ex +e). // Example run: llil tt1.txt tt2.txt >out.txt // Uncomment next line to sort by creating a multimap (instead of via +the sort function) // #define LLIL_SORT_VIA_MULTIMAP_L 1 #include <cstddef> #include <cstdint> #include <cstdlib> #include <ctime> #include <cstdio> #include <string> #include <vector> #include <set> #include <map> #include <unordered_map> #include <algorithm> #include <utility> #include <iostream> #include <fstream> #include <sstream> // ------------------------------------------------------------ // Performance hacks to speed up IO. // See https://www.reddit.com/r/rust/comments/9xedap/how_to_achieve_fa +st_stdinstdout_io_suitable_for/ // Avoid flush by using "\n" instead of std::endl (this made a big dif +ference in this program!) // This one made almost no difference in this program: // const auto io_speed_up =[](){ // std::ios::sync_with_stdio(false); // std::cin.tie(nullptr); // return nullptr; // }(); // ------------------------------------------------------------ using str_int_type = std::pair<std::string, int>; using map_str_int_type = std::unordered_map<std::string, int>; using vec_str_int_type = std::vector<str_int_type>; // Mimic the Perl get_properties subroutine static void get_properties( int nfiles, // in: the number of input files char* fname[], // in: the input file names map_str_int_type& hash_ret) // out: a hash of properties { for (int i = 0; i < nfiles; ++i) { std::ifstream llfile(fname[i]); if (!llfile) { std::cerr << "Error opening '" << fname[i] << "'\n"; return; } for (std::string line; std::getline(llfile, line); ) { std::string word, count; std::stringstream ssline(line); std::getline(ssline, word, '\t'); std::getline(ssline, count); hash_ret[word] += std::stoi(count); } } } #ifdef LLIL_SORT_VIA_MULTIMAP_L // Convert a std::unordered_map<key, value> to a std::multimap<value, +key> static std::multimap<int, std::string> invert_map(const std::unordered +_map<std::string, int>& m) { std::multimap<int, std::string> mm; for (std::unordered_map<std::string, int>::const_iterator it = m.be +gin(); it != m.end(); ++it) { mm.insert( std::make_pair(it->second, it->first) ); } return mm; } #endif int main(int argc, char* argv[]) { if (argc < 2) { std::cerr << "usage: llil file1 file2 ... >out.txt\n"; return 1; } #ifdef LLIL_SORT_VIA_MULTIMAP_L std::cerr << "llil start (multimap version)\n"; #else std::cerr << "llil start (sort version)\n"; #endif time_t tstart1 = ::time(NULL); // Create the hash of properties map_str_int_type hash_ret; get_properties(argc - 1, &argv[1], hash_ret); time_t tend1 = ::time(NULL); long taken1 = static_cast<long>(::difftime(tend1, tstart1) + 0.5); std::cerr << "get_properties : " << taken1 << " secs\n"; // Sort descending by value, i.e. mimic this Perl code in C++: // sort { $href->{$b} <=> $href->{$a} || $a cmp $b } keys %{$href +} time_t tstart2 = ::time(NULL); #ifdef LLIL_SORT_VIA_MULTIMAP_L std::multimap<int, std::string> newmap = invert_map(hash_ret); for (std::multimap<int, std::string>::reverse_iterator it = newmap. +rbegin(); it != newmap.rend(); ++it) { std::cout << it->second << '\t' << it->first << "\n"; } #else vec_str_int_type v( hash_ret.begin(), hash_ret.end() ); std::sort( v.begin(), v.end(), [](const str_int_type& left, const str_int_type& right) { return + right.second != left.second ? right.second < left.second : left.firs +t < right.first; } ); for ( const std::pair<const std::string, int>& n : v ) { std::cout << n.first << '\t' << n.second << "\n"; } #endif time_t tend2 = ::time(NULL); long taken2 = static_cast<long>(::difftime(tend2, tstart2) + 0.5); long taken = static_cast<long>(::difftime(tend2, tstart1) + 0.5); std::cerr << "sort + output : " << taken2 << " secs\n"; std::cerr << "total : " << taken << " secs\n"; return 0; }

This is a straightforward translation of my Perl solution into C++ 11. I chose a C++ std::unordered_map because it seems the closest container to a Perl hash.

Unlike Perl, you can't sort a std::unordered_map on the fly in a one-liner. So I tried two different ways to do it:

  • Copy the std::unordered_map to a std::vector and sort it via the std::sort algorithm
  • Copy the std::unordered_map<key, value> to a std::multimap<value, key> (see std::multimap)
Alternative suggestions welcome.

Testing Correctness and Measuring Performance

To get a feel for performance and correctness, I used this simple and crude Perl program gen-llil.pl:

# gen-llil.pl # Crude program to generate a big LLiL test file to use in benchmarks # perl gen-llil.pl big2.txt 200 3 - produces a test file with size = 3 +5,152,000 bytes use strict; use warnings; use autodie; { my $ordmin = ord('a'); my $ordmax = ord('z') + 1; # Generate a random word sub gen_random_word { my $word = shift; # word prefix my $nchar = shift; # the number of random chars to append for my $i (1 .. $nchar) { $word .= chr( $ordmin + int( rand($ordmax - $ordmin) ) ); } return $word; } } sub create_test_file { my $fname = shift; my $count = shift; my $wordlen = shift; open( my $fh_out, '>', $fname ); for my $c ( 'aaa' .. 'zzz' ) { for my $i (1 .. $count) { print {$fh_out} gen_random_word( $c, $wordlen ) . "\t" . 1 . +"\n"; } } } my $outfile = shift; my $count = shift; my $wordlen = shift; $outfile or die "usage: $0 outfile count wordlen\n"; $count or die "usage: $0 outfile count wordlen\n"; print "generating test file '$outfile' with count '$count'\n"; create_test_file($outfile, $count, $wordlen); print "file size=", -s $outfile, "\n";

I generated three biggish test files (size 35,152,000 bytes) with:

perl gen-llil.pl big1.txt 200 3 perl gen-llil.pl big2.txt 200 3 perl gen-llil.pl big3.txt 200 3
then ran the Perl version:
perl llil.pl big1.txt big2.txt big3.txt >perl.tmp
and the C++ version:
llil big1.txt big2.txt big3.txt >cpp.tmp
and verified they produced the same result:
diff perl.tmp cpp.tmp

Performance Results

To get the ball rolling, here are my initial performance results for the Perl and C++ solutions given above running Strawberry Perl on my Windows Laptop PC with 8 GB RAM.

Perl (max Windows Private Bytes for the Perl process was about 2,236,648K)

> perl llil.pl big1.txt big2.txt big3.txt >perl.tmp llil start get_properties : 11 secs sort + output : 74 secs total : 85 secs

C++ (max Windows Private Bytes for the llil process was about 1,176,048K)

> llil big1.txt big2.txt big3.txt >cpp.tmp llil start (sort version) get_properties : 9 secs sort + output : 7 secs total : 16 secs

Verify correctness:

> diff cpp.tmp perl.tmp

I'd also find it interesting and educational to compare solutions to this little problem in a variety of languages. So please feel free to respond away in your favourite language!

Some Other Rosetta Code Nodes

Updated: Fixed a commented out for debugging line in llil.pl one minute after posting. :)

Replies are listed 'Best First'.
Re: Rosetta Code: Long List is Long
by choroba (Cardinal) on Nov 30, 2022 at 23:25 UTC
    As usually, if speed is your concern, you can trade it for space.
    #!/usr/bin/perl use warnings; use strict; use feature qw{ say }; warn "start\n"; my $tstart1 = time; my (%by_count, %by_word); while (<>) { chomp; my ($k, $v) = split /\t/, $_, 2; if (exists $by_word{$k}) { $v += $by_word{$k}; delete $by_count{ $by_word{$k} }{$k}; } undef $by_count{$v}{$k}; $by_word{$k} = $v; } my $tend1 = time; warn "get properties: ", $tend1 - $tstart1, " secs\n"; my $tstart2 = time; for my $count (sort { $b <=> $a } keys %by_count) { say "$_\t$count" for sort keys %{ $by_count{$count} }; } my $tend2 = time; warn "sort + output: ", $tend2 - $tstart2, " secs\n"; warn "total: ", $tend2 - $tstart1, " secs\n";

    Comparison?

    llil start get_properties : 13 secs sort + output : 85 secs total : 98 secs start get properties: 21 secs sort + output: 25 secs total: 46 secs
    The output is identical.

    map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]

      Excellent work choroba! On my machine I see:

      get properties: 17 secs sort + output: 18 secs total: 35 secs
      more than twice as fast as my original Perl version of 85 secs ... with Windows Private Bytes increasing from 2,236,648K to 3,516,476K.

      The following is my fun spin using dualvar, based on choroba.pl. I tried minimizing memory consumption by using one hash and one array. The memory consumption is similar to llil.pl and performance slightly faster than choroba.pl.

      See also, parallel solution update.

      #!/usr/bin/perl use warnings; use strict; use feature qw{ say }; use Scalar::Util qw{ dualvar }; warn "start\n"; my $tstart1 = time; our (%by_word, @data); while (<>) { chomp; my ($k, $v) = split /\t/, $_; $by_word{$k} += $v; } my $tend1 = time; warn "get properties: ", $tend1 - $tstart1, " secs\n"; my $tstart2 = time; while (my ($k, $v) = each %by_word) { push @data, dualvar($v, $k); } # output array of dualvars; sorted by number, string say "$_\t".(0+$_) for sort { $b <=> $a } sort { $a cmp $b } @data; my $tend2 = time; warn "sort + output: ", $tend2 - $tstart2, " secs\n"; warn "total: ", $tend2 - $tstart1, " secs\n";
        Interesting!

        What really surprised me was how slow it became once I tried to sort the data just once:

        sort { $b <=> $a || $a cmp $b } @data;
        Before:
        sort + output: 19 secs
        After:
        sort + output: 32 secs

        map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]

      On macOS, exiting the script requires 15 additional seconds. This is a great test case for comparing "my" vs "our".

      my (%by_count, %by_word); # slow cleanup/exiting $ time perl choroba.pl big1.txt big2.txt big3.txt >cpp.tmp start get properties: 21 secs sort + output: 25 secs total: 46 secs real 1m2.072s user 1m0.182s sys 0m1.885s
      our (%by_count, %by_word); # fast cleanup/exiting $ time perl choroba.pl big1.txt big2.txt big3.txt >cpp.tmp start get properties: 21 secs sort + output: 25 secs total: 46 secs real 0m47.062s user 0m45.505s sys 0m1.549s
        What Perl version do you run? The final garbage collecting takes about 6 secs on my Linux machine in both 5.26.1 and blead.

        map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
Re: Rosetta Code: Long List is Long -- parallel
by marioroy (Parson) on Dec 01, 2022 at 06:37 UTC

    Greetings, eyepopslikeamosquito

    > Please feel free to respond away with solutions...

    I made a parallel demonstration. The results are taken from a 32-core Linux box. On Windows, run Cygwin's Perl for best performance.

    Edit1: Updated parallel code; now passing less than 26 workers.
    Edit2: Added results for max_workers => 4, 7, 13, and 26.
    Edit3: Minimized memory consumption; one hash, one array of dualvars.
    Edit4: Replaced regular expression with substr($_,0,1) eq $char.

    Results

    # Modification: our (%by_count, %by_word); $ time perl choroba_mod.pl big1.txt big2.txt big3.txt >oo1.txt start get properties: 9 secs sort + output: 14 secs total: 23 secs real 0m 23.083s user 0m 22.568s sys 0m 0.491s # Run parallel on physical cores: max_workers => 4 $ time taskset -c 0-31 perl mce.pl big1.txt big2.txt big3.txt >oo2.txt start get properties + pre-sort: 14 secs final sort + output: 2 secs total: 16 secs real 0m 15.434s user 0m 52.223s sys 0m 0.824s # Run parallel on physical cores: max_workers => 7 $ time taskset -c 0-31 perl mce.pl big1.txt big2.txt big3.txt >oo3.txt start get properties + pre-sort: 8 secs final sort + output: 2 secs total: 10 secs real 0m 10.429s user 0m 53.032s sys 0m 0.982s # Run parallel on physical cores: max_workers => 13 $ time taskset -c 0-31 perl mce.pl big1.txt big2.txt big3.txt >oo4.txt start get properties + pre-sort: 6 secs final sort + output: 1 secs total: 7 secs real 0m 7.262s user 0m 53.556s sys 0m 1.259s # Run parallel on physical cores: max_workers => 26 $ time taskset -c 0-31 perl mce.pl big1.txt big2.txt big3.txt >oo5.txt start get properties + pre-sort: 4 secs final sort + output: 2 secs total: 6 secs real 0m 6.420s user 0m 55.158s sys 0m 1.413s # Verify correctness: $ diff cpp.tmp oo1.txt # choroba $ diff cpp.tmp oo2.txt # MCE 4 workers $ diff cpp.tmp oo3.txt # MCE 7 workers $ diff cpp.tmp oo4.txt # MCE 13 workers $ diff cpp.tmp oo5.txt # MCE 26 workers

    Parallel code

    Basically, workers process and gather orderly letters "a" through "z". I was hoping to gather an array of dualvars, but forgotten serialization removes the numeric part.

Re: Rosetta Code: Long List is Long
by Tux (Canon) on Dec 01, 2022 at 09:18 UTC

    Not a real contestant, but a different approach using TSV/XS (just because I could not resist):

    $ cat -te tt1.txt camel^I42$ pearl^I94$ dromedary^I69$ $ cat -te tt2.txt camel^I8$ hello^I12345$ dromedary^I1$ $ perl -MText::CSV_XS=csv -wE'csv(in=>*ARGV,sep=>"\t",out=>undef,on_in +=>sub{$x{$_[1][0]}+=$_[1][1]});printf"%-20s %6d\n",$_,$x{$_}for sort +keys%x' tt?.t camel 50 dromedary 70 hello 12345 pearl 94

    Enjoy, Have FUN! H.Merijn
Re: Rosetta Code: Long List is Long -- oneliner
by Discipulus (Abbot) on Dec 01, 2022 at 08:00 UTC
    Hello eyepopslikeamosquito,

    > So please feel free to respond away in your favourite language!

    If oneliner is a language..

    cat lill01.txt camel 42 pearl 94 dromedary 69 cat lill02.txt camel 8 hello 12345 dromedary 1 perl 94 perl -lane "$r{$F[0]}+=$F[1]}{print qq($_\t$r{$_})for sort{$r{$b}<=>$r +{$a}||$a cmp $b}keys %r" lill01.txt lill02.txt # or shortened by one char because perl -a implies -n perl -lae "$r{$F[0]}+=$F[1]}{print qq($_\t$r{$_})for sort{$r{$b}<=>$r{ +$a}||$a cmp $b}keys %r" lill01.txt lill02.txt hello 12345 pearl 94 perl 94 dromedary 70 camel 50

    Deparsed for newcomers..

    L*

    There are no rules, there are no thumbs..
    Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.
Re: Rosetta Code: Long List is Long -- Python
by parv (Vicar) on Dec 02, 2022 at 10:26 UTC

    Environment: FreeBSD stable/13 (amd64) as guest in VirtualBox 6.1.30 on Windows 10 host; 2 CPUs & ~8 GB of RAM are allocated to the VM on ThinkPad X260 (Intel i5-6500U CPU, 16 GB RAM). perl is 5.32.1; python is 3.10.8.

    I may attempt at a Rust, which is I am currently learning, version.

    Your Perl version takes ~7-8 minute (time to collect data is same enough as my Python version). Mind that time values are of just one run each.

    llil start get_properties : 44 secs sort + output : 429 secs total : 473 secs

    From choroba's attempt ...

    start get properties: 60 secs sort + output: 63 secs total: 123 secs

    From my Python attempt (generally takes ~3-4 minute) ...

    # Manually rounded off after posting. collect time : 37.6 s sort_via_cmp_to_key time : 115.1 s sort+format time : 115.1 s total processing time : 152.7 s output time : 22.0 s total time : 174.7 s

    ... most of time savings (of ~10-20 s) over my earlier Python attempts came from ...

    • printing the output as one string instead of printing each line in a loop;
    • sorting the dict via keys instead of sorting an iterator of tuples created from the "dict";
    • and using subtraction to compare the numbers instead of explicitly returning one of -1, 0, 1 based on [><=] comparison.

    Later I realized there was one extraneous loop; removed now, but the patch keeps the history.

    ... said patch (this filler text is to better separate from preceding code portion) ...

      This is neat, parv. I compared Python 3.10.7 against Pyston-lite and Pyston.

      Python:

      $ python3 llil.py big1.txt big2.txt big3.txt > out1.txt collect time : 5.2 s sort_via_cmp_to_key time: 17.7 s sort+format time : 17.7 s total processing time : 22.9 s output time : 3.5 s total time : 26.4 s

      Python with Pyston-lite loaded automatically:

      # Install pyston_lite_autoload to ~/.local/lib/python3.10/site-package +s/. $ python3 -m pip install --user pyston_lite_autoload $ python3 llil.py big1.txt big2.txt big3.txt > out2.txt collect time : 4.4 s sort_via_cmp_to_key time: 15.8 s sort+format time : 15.8 s total processing time : 20.2 s output time : 3.2 s total time : 23.4 s

      Pyston 2.3.5:

      I also tried Pyston 2.3.5. But first, I needed to modify the function declarations to resolve "TypeError: 'type' object is not subscriptable.

      $ diff llil.py llil2.py 53c53 < def collect( data_list :list ) ->dict[ str, int ]: --- > def collect( data_list :list ) ->dict: 87c87 < def process( cat_count :dict[ str, int ] ) ->Generator[ str, None, N +one ]: --- > def process( cat_count :dict ) ->Generator: 110c110 < def sort_via_cmp_to_key( cat_count :dict[ str, int ] ) ->list[ str ] +: --- > def sort_via_cmp_to_key( cat_count :dict ) ->list:
      $ ./pyston_2.3.5/bin/pyston3 llil2.py big1.txt big2.txt big3.txt > out +3.txt collect time : 3.7 s sort_via_cmp_to_key time: 12.2 s sort+format time : 12.2 s total processing time : 15.9 s output time : 3.0 s total time : 18.8 s

      Look at Pyston go; taking Python performance to a new level :) How does Python/Pyston compare to Perl? I ran the dualvar demonstration for comparison.

      $ perl dualvar.pl big1.txt big2.txt big3.txt >out4.txt start get properties: 6 secs sort + output: 16 secs total: 22 secs

      The Perl code consumes around 2340 MB versus 1790 MB for Python.

      See also:

      Python 3.12 Goals

      Python is about to get faster

      The Pyston Blog

      Numba

        In short, could you do another Python, Pyston* run with sort function replaced with ...

        def sort_native( cat_count ): once = sorted( cat_count.keys() ) return sorted( once, key = lambda k: cat_count[ k ], reverse = True + )

        ...?

        In long, before eyepopslikeamosquito posted ...

        sort { $href->{$b} <=> $href->{$a} } sort keys %{$href}

        ... I had missed to notice the sorting order. I decided to do the same for Python (native) version instead of using functools.cmp_to_key function. I also realized that I was doing the sorting other way (sorting keys by value, followed by sorting by key), so was not getting the expected output (which made me to use cmp_to_key instead).

        Replacing sort_via_cmp_to_key with ...

        def sort_native( cat_count :dict[ str, int ] ) ->list[ str ]: """ Returns: A `list` of sorted keys by decreasing order of values & increasi +ng order of keys. Args: cat_count: A `dict` with string key & integer value. """ once = sorted( cat_count.keys() ) return sorted( once, key = lambda k: cat_count[ k ], reverse = True + )

        ... reduces the sort time by ~10 times (~11-16 s; also produces the expected output) in my run environment.

        time passes, so slowly🎶 ... Putting the complete program here (~28-35 s) ...

Re: Rosetta Code: Long List is Long (Updated Solutions)
by eyepopslikeamosquito (Bishop) on Dec 05, 2022 at 09:32 UTC

    As you might expect, I've created improved versions of my original Perl and C++ solutions posted in the root node. Since I'm all out of fresh ideas to try, I'll post my improved solutions here. Further performance improvement suggestions are welcome.

    Improved Perl Version llil2.pl

    # llil2.pl # Example run: perl llil2.pl tt1.txt tt2.txt tt3.txt >out.txt use strict; use warnings; # -------------------------------------------------------------------- +-- # LLiL specification # ------------------ # A LLiL-format file is a text file. # Each line consists of a lowercase name a TAB character and a non-neg +ative integer count. # That is, each line must match : ^[a-z]+\t\d+$ # For example, reading the LLiL-format files, tt1.txt containing: # camel\t42 # pearl\t94 # dromedary\t69 # and tt2.txt containing: # camel\t8 # hello\t12345 # dromedary\t1 # returns this hashref: # $hash_ret{"camel"} = 50 # $hash_ret{"dromedary"} = 70 # $hash_ret{"hello"} = 12345 # $hash_ret{"pearl"} = 94 # That is, values are added for items with the same key. # # To get the required LLiL text, you must sort the returned hashref # descending by value and insert a TAB separator: # hello\t12345 # pearl\t94 # dromedary\t70 # camel\t50 # To make testing via diff easier, we further sort ascending by name # for lines with the same value. # -------------------------------------------------------------------- +-- # Function get_properties # Read a list of LLiL-format files # Return a reference to a hash of properties sub get_properties { my $files = shift; # in: reference to a list of LLiL-format fil +es my %hash_ret; # out: reference to a hash of properties for my $fname ( @{$files} ) { open( my $fh, '<', $fname ) or die "error: open '$fname': $!"; while (<$fh>) { chomp; my ($word, $count) = split /\t/; $hash_ret{$word} += $count; } close($fh) or die "error: close '$fname': $!"; } return \%hash_ret; } # ----------------- mainline ----------------------------------------- +-- @ARGV or die "usage: $0 file...\n"; my @llil_files = @ARGV; warn "llil2 start\n"; my $tstart1 = time; my $href = get_properties( \@llil_files ); my $tend1 = time; my $taken1 = $tend1 - $tstart1; warn "get_properties : $taken1 secs\n"; my $tstart2 = time; # Using two sorts is waaay faster than one in Perl for some reason! (s +ee [id://11148545]) for my $key ( sort { $href->{$b} <=> $href->{$a} } sort keys %{$href} +) { print "$key\t$href->{$key}\n"; } my $tend2 = time; my $taken2 = $tend2 - $tstart2; my $taken = $tend2 - $tstart1; warn "sort + output : $taken2 secs\n"; warn "total : $taken secs\n";

    Improved C++ Version llil2.cpp

    // llil2.cpp. C++ 11 version of Perl llil.pl. // llil2.cpp is faster than llil.cpp while also clarifying limits: // - all keys should be less than 200 or so characters in length // - numbers are 64 bit integers (max: 9,223,372,036,854,775,807) // g++ compile on Linux: // g++ -o llil2 -std=c++11 -Wall -O3 llil2.cpp // This g++ command also works with mingw C++ compiler (https://source +forge.net/projects/mingw-w64) // that comes bundled with Strawberry Perl (C:\Strawberry\c\bin\g++.ex +e). // Example run: llil2 tt1.txt tt2.txt tt3.txt >out.txt #include <cstddef> #include <cstdint> #include <cstdlib> #include <cstring> #include <ctime> #include <cstdio> #include <string> #include <vector> #include <set> #include <map> #include <algorithm> #include <utility> #include <iostream> #include <fstream> #include <sstream> static_assert(sizeof(size_t) == sizeof(int64_t), "size_t too small, ne +ed a 64-bit compile"); // ------------------------------------------------------------------- +--------- // Crude hack to see Windows Private Bytes in Task Manager by sleeping + at // program end (see also sleep hack at end of main) // #include <chrono> // #include <thread> // For some performance hacks to speed up C++ I/O see: // https://www.reddit.com/r/rust/comments/9xedap/how_to_achieve_fast +_stdinstdout_io_suitable_for/ // The only one we use here is to prefer "\n" to std::endl to reduce s +tdout flushing // ------------------------------------------------------------------- +--------- typedef long long llil_int_type; using str_int_type = std::pair<std::string, llil_int_type>; using map_str_int_type = std::map<std::string, llil_int_type>; using vec_str_int_type = std::vector<str_int_type>; // Mimic the Perl get_properties subroutine -------------------------- +-- // Limit line length and use lower level ANSI C functions to try to bo +ost I/O performance // TODO (maybe): // - reading: Try ::setvbuf(fh, NULL, _IOFBF, 65536) or some such on + input files // - writing: Try ::setvbuf(stdout, stdout_buf, _IOFBF, sizeof(stdou +t_buf)) on stdout // ... or instead of writing to stdout, take an output fi +le as a program argument #define MAX_LINE_LEN_L 255 static void get_properties( int nfiles, // in: the number of input files char* fname[], // in: the input file names map_str_int_type& hash_ret) // out: a hash of properties { FILE* fh; char line[MAX_LINE_LEN_L+1]; char* word; char* count; for (int i = 0; i < nfiles; ++i) { fh = ::fopen(fname[i], "r"); if (fh == NULL) { std::cerr << "Error opening '" << fname[i] << "'\n"; return; } while ( ::fgets(line, MAX_LINE_LEN_L, fh) != NULL ) { word = ::strtok(line, "\t"); count = ::strtok(NULL, "\n"); hash_ret[word] += ::atoll(count); } ::fclose(fh); } } // ------------------------------------------------------------------- +-- int main(int argc, char* argv[]) { if (argc < 2) { std::cerr << "usage: llil2 file1 file2 ... >out.txt\n"; return 1; } std::cerr << "llil2 start\n"; time_t tstart1 = ::time(NULL); // Create the hash of properties map_str_int_type hash_ret; get_properties(argc - 1, &argv[1], hash_ret); time_t tend1 = ::time(NULL); long taken1 = static_cast<long>(::difftime(tend1, tstart1) + 0.5); std::cerr << "get_properties : " << taken1 << " secs\n"; // Sort descending by value, i.e. mimic this Perl code in C++: // sort { $href->{$b} <=> $href->{$a} || $a cmp $b } keys %{$href +} time_t tstart2 = ::time(NULL); vec_str_int_type v( hash_ret.begin(), hash_ret.end() ); std::sort( v.begin(), v.end(), [](const str_int_type& left, const str_int_type& right) { return + right.second != left.second ? right.second < left.second : left.firs +t < right.first; } ); // Output the merged properties for ( auto const& n : v ) { std::cout << n.first << '\t' << n.secon +d << '\n'; } time_t tend2 = ::time(NULL); long taken2 = static_cast<long>(::difftime(tend2, tstart2) + 0.5); long taken = static_cast<long>(::difftime(tend2, tstart1) + 0.5); std::cerr << "sort + output : " << taken2 << " secs\n"; std::cerr << "total : " << taken << " secs\n"; // Hack to see Private Bytes in Windows Task Manager (uncomment nex +t line so process doesn't exit too quickly) // std::this_thread::sleep_for(std::chrono::milliseconds(90000000 +)); return 0; }

    Performance Analysis

    Performance of my second Perl version improved from:

    get_properties : 11 secs sort + output : 74 secs total : 85 secs
    to:
    get_properties : 11 secs sort + output : 25 secs total : 36 secs
    update (much later in llil2grt.pl) to:
    get_properties : 10 secs sort + output : 20 secs total : 30 secs

    Performance of my second C++ version improved from:

    get_properties : 9 secs sort + output : 7 secs total : 16 secs
    to:
    get_properties : 6 secs sort + output : 6 secs total : 12 secs

    Memory use (Windows Private Bytes) was 2,896,104K for the Perl version (update: 2,657,968K), 1,176,048K for the C++ std::unordered_map version (update: 1,218,720K for std::map).

    Update: Surprisingly, making a one line change in llil2.cpp above from:

    using map_str_int_type = std::unordered_map<std::string, llil_int_type +>; // to (llil2a.cpp): using map_str_int_type = std::map<std::string, llil_int_type>;
    resulted in a significant speed improvement in llil2a (with similar Windows Private Bytes):
    get_properties : 4 secs sort + output : 5 secs total : 9 secs

    My second Perl version improved from 85 seconds to 36 seconds (update: much later to 30 seconds in llil2grt). Incredibly, this spectacular performance improvement is entirely due to a very surprising one line change from:

    sort { $href->{$b} <=> $href->{$a} || $a cmp $b } keys %{$href}
    to:
    sort { $href->{$b} <=> $href->{$a} } sort keys %{$href} )
    When more is known of the reason for this incredible difference, I'll update this node.

    In contrast, I had to work a lot harder to improve the performance of my C++ version, switching in desperation to some ugly lower level ANSI C functions in its get_properties function. For cheap thrills, I also switched from a 32-bit to a 64-bit integer counter llil_int_type. I suspect future minor performance tweaks may come from further improving I/O (e.g. by fiddling with file buffering and buffer sizes).

    As described here this is by far Perl's best performance so far in my three serious Rosetta nodes:

    • For the simple GoL algorithm, C++ was 12.5 times faster, memory use was 2.8 times less.
    • For the complex GoL algorithm, C++ was 212.5 times faster; memory use was 10.1 times less.
    • For the Long List is Long algorithm (this node), C++ was 3 times faster; memory use 2.2 times less.
    I suspect this is because Long List is Long seems to be mostly I/O bound, while the GoL algorithms were mostly CPU bound.

    Updated: Changed std::unordered_map to std::map in llil2.cpp above.

      Some wiser monk might correct me, but I think this difference is because "simple" blocks can be internally optimized

      (edited: typoes corrected after reading the replies)

      my @sorted = sort @unsorted; my @sorted = sort { $a <=> $b } @unsorted; my @sorted = sort { $a cmp $b } @unsorted;

      Are all internally optimized to something simple and efficient. Once the block gets more complicated, like having multiple statements and/or multiple expressions, every block has to be wrapped in scopes *for each call* to that sub. This is one of the reasons why the Larry-Rossler Guttman-Rosler Transform is so efficient.

      my @sorted = sort { $a->{id} <=> $b->{id} || $a->{foo} cmp $b->{foo} } + @unsorted; # SLOW my @sorted = map { $_->[1] } sort { $a->[0] cmp $b->[0] } map { pack " +l>A*", $_->{id}, $_->{foo} } @unsorted; # FAST

      Enjoy, Have FUN! H.Merijn
        This is one of the reasons why the Larry-Rossler is so efficient.

        I think that you are thinking of the Guttman-Rosler Transform?

        my @sorted = sort { $b->{id} <=> $b->{id} || $a->{foo} cmp $b->{foo} } + @unsorted; # SLOW

        Comparing $b->{id} to itself will not work very well.

      You may have missed the dualvar solution. Sorting an array of dualvars involves zero hash lookups.

      use Scalar::Util 'dualvar'; our @data; while (my ($k, $v) = each %{$href}) { push @data, dualvar($v, $k); } # output array of dualvars; sorted by number, string print "$_\t".(0+$_)."\n" for sort { $b <=> $a } sort @data;
      # llil2.pl sort + output : 19 secs # llil3.pl sort + output : 16 secs -- dualvar

        I noticed it, but (wrongly) assumed applying its remarkable two sort trick to my original solution would have the same effect.

        For completeness, I created llil2d.pl (shown below) by applying your dualvar array trick to my original llil2.pl two-sort solution, with minimal changes. I can confirm that it is indeed about 3 seconds faster and with slightly lower memory use. Despite using Perl for 20 years, I'd never heard of dualvar before (update: oops, turns out I had :). Huge kudos to marioroy for unearthing this!

        llil2d start get_properties : 11 secs sort + output : 22 secs total : 33 secs Memory use (Windows Private Bytes): 2,824,184K (slightly lower than 2,896,104K for llil2.pl)

        For completeness, here is my adjusted llil2d.pl:

        # llil2d.pl. Remarkable dualvar version based on [marioroy]'s concocti +on. # Example run: perl llil2d.pl tt1.txt tt2.txt tt3.txt >out.txt use strict; use warnings; use feature qw{say}; use Scalar::Util qw{dualvar}; # -------------------------------------------------------------------- +-- # LLiL specification # ------------------ # A LLiL-format file is a text file. # Each line consists of a lowercase name a TAB character and a non-neg +ative integer count. # That is, each line must match : ^[a-z]+\t\d+$ # For example, reading the LLiL-format files, tt1.txt containing: # camel\t42 # pearl\t94 # dromedary\t69 # and tt2.txt containing: # camel\t8 # hello\t12345 # dromedary\t1 # returns this hashref: # $hash_ret{"camel"} = 50 # $hash_ret{"dromedary"} = 70 # $hash_ret{"hello"} = 12345 # $hash_ret{"pearl"} = 94 # That is, values are added for items with the same key. # # To get the required LLiL text, you must sort the returned hashref # descending by value and insert a TAB separator: # hello\t12345 # pearl\t94 # dromedary\t70 # camel\t50 # To make testing via diff easier, we further sort ascending by name # for lines with the same value. # -------------------------------------------------------------------- +-- # Function get_properties # Read a list of LLiL-format files # Return a reference to a hash of properties sub get_properties { my $files = shift; # in: reference to a list of LLiL-format fil +es my %hash_ret; # out: reference to a hash of properties for my $fname ( @{$files} ) { open( my $fh, '<', $fname ) or die "error: open '$fname': $!"; while (<$fh>) { chomp; my ($word, $count) = split /\t/; $hash_ret{$word} += $count; } close($fh) or die "error: close '$fname': $!"; } return \%hash_ret; } # ----------------- mainline ----------------------------------------- +-- @ARGV or die "usage: $0 file...\n"; my @llil_files = @ARGV; warn "llil2d start\n"; my $tstart1 = time; my $href = get_properties( \@llil_files ); my $tend1 = time; my $taken1 = $tend1 - $tstart1; warn "get_properties : $taken1 secs\n"; my $tstart2 = time; my @data; while ( my ($k, $v) = each %{$href} ) { push @data, dualvar($v, $k) } # Using two sorts is waaay faster than one! (see [id://11148545]) for my $key ( sort { $b <=> $a } sort @data ) { say "$key\t" . (0 + $key); } my $tend2 = time; my $taken2 = $tend2 - $tstart2; my $taken = $tend2 - $tstart1; warn "sort + output : $taken2 secs\n"; warn "total : $taken secs\n";

        Update: llil2grt.pl is about three seconds faster than llil2d.pl above, while using slightly less memory.

        References Added Later

        Dualvar:

        Some ideas to try in the future:

        See also:

      After being surprised in the extreme by marioroy's astonishing two-sort dualvar Perl solution and my accidental jwkrahn-inspired llil2grt.pl Perl GRT solution, I couldn't restrain myself from trying to steal ideas from these two nodes to improve the (9 second) speed of my fastest llil2a C++ solution.

      llil2m.cpp

      Here is my C++ solution inspired by marioroy's famous two-sort Perl concoction.

      The above program can be compiled with or without MARIOROY_TWO_SORT_TRICK_L defined.

      Without MARIOROY_TWO_SORT_TRICK_L defined I saw:

      > llil2m big1.txt big2.txt big3.txt >llil2m.tmp llil2m start get_properties CPU time : 4.187 secs vector copy CPU time : 0.612 secs vector sort CPU time : 2.89 secs write stdout CPU time : 1.383 secs total CPU time : 9.074 secs total wall clock : 9 secs
      So far so good. This is just the fastest C++ solution found previously.

      Beside myself with excitement, I defined MARIOROY_TWO_SORT_TRICK_L ... but sobbed when I saw:

      > llil2m big1.txt big2.txt big3.txt >llil2m.tmp llil2m start get_properties CPU time : 4.286 secs vector copy CPU time : 0.613 secs vector stable sort CPU time : 2.977 secs write stdout CPU time : 1.363 secs total CPU time : 9.244 secs total wall clock : 9 secs Memory use (Windows Private Bytes): 1,218,716K

      No improvement. At all. Aaaargh! When I changed stable_sort to sort the sort speed improved dramatically from 2.977 secs to just 0.750 secs ... but of course the output was wrong. It seems there is a high performance price to pay for a stable sort in this case.

      Oh well, at least we now have an alternative nine second solution in C++ using a stable sort.

      Sorting Algorithms

      BTW, from sort - perl pragma to control sort() behaviour

      The sort pragma is now a no-op, and its use is discouraged. ... The default sort has been stable since v5.8.0, and given this consistent behaviour for almost two decades, everyone has come to assume stability.

      Given the above performance difference between stable and non-stable sorts, this looks like an unfortunate decision to me. After all, there may be times when you really want the performance boost of a non-stable sort in Perl.

      Update: Stroustrup, in his C++ Programming Language book, notes that stable_sort improves towards N*log(N) when the system has sufficient memory, but degrades to N*log(N)*log(N) otherwise ... so I guess the use case for needing a non-stable sort in Perl is when you are sorting huge arrays. See also Sorting algorithm (wikipedia).

      llil2grt.cpp

      Here is my C++ solution inspired by this llil2grt.pl Perl GRT solution.

      Desperate to avoid the sort function after being burnt during my llil2m.cpp adventure, and remembering jwkrahn's cute GRT trick of achieving a descending sort via an ascending sort of negative numbers, I had to try a similar stunt in C++.

      In llil2grt below, I avoided sort simply by creating an inverted set_int_str_type container from the existing map_str_int_type container.

      // llil2grt.cpp. // Inspired by llilgrt.pl: improve sort performance via a negative cou +nt. // g++ compile on Linux: // g++ -o llil2grt -std=c++11 -Wall -O3 llil2grt.cpp // This g++ command also works with mingw C++ compiler (https://source +forge.net/projects/mingw-w64) // that comes bundled with Strawberry Perl (C:\Strawberry\c\bin\g++.ex +e). // Example run: llil2grt tt1.txt tt2.txt tt3.txt >out.txt #include <cstddef> #include <cstdint> #include <cstdlib> #include <cstring> #include <ctime> #include <cstdio> #include <string> #include <vector> #include <set> #include <map> #include <algorithm> #include <utility> #include <iostream> #include <fstream> #include <sstream> static_assert(sizeof(size_t) == sizeof(int64_t), "size_t too small, ne +ed a 64-bit compile"); // ------------------------------------------------------------------- +--------- // Crude hack to see Windows Private Bytes in Task Manager by sleeping + at // program end (see also sleep hack at end of main) // #include <chrono> // #include <thread> // ------------------------------------------------------------------- +--------- typedef long long llil_int_type; using str_int_type = std::pair<std::string, llil_int_type>; using int_str_type = std::pair<llil_int_type, std::string>; using map_str_int_type = std::map<std::string, llil_int_type>; using vec_str_int_type = std::vector<str_int_type>; using set_int_str_type = std::set<int_str_type>; // Mimic the Perl get_properties subroutine -------------------------- +-- // Limit line length and use ANSI C functions to try to boost performa +nce #define MAX_LINE_LEN_L 255 static void get_properties( int nfiles, // in: the number of input files char* fname[], // in: the input file names map_str_int_type& hash_ret) // out: a hash of properties { FILE* fh; char line[MAX_LINE_LEN_L+1]; char* word; llil_int_type count; for (int i = 0; i < nfiles; ++i) { fh = ::fopen(fname[i], "r"); if (fh == NULL) { std::cerr << "Error opening '" << fname[i] << "' : errno=" << + errno << "\n"; continue; } while ( ::fgets(line, MAX_LINE_LEN_L, fh) != NULL ) { word = ::strtok(line, "\t"); count = ::atoll( ::strtok(NULL, "\n") ); hash_ret[word] -= count; } ::fclose(fh); } } // ------------------------------------------------------------------- +-- int main(int argc, char* argv[]) { if (argc < 2) { std::cerr << "usage: llil2grt file1 file2 ... >out.txt\n"; return 1; } std::cerr << "llil2grt start\n"; time_t tstart1 = ::time(NULL); clock_t cstart1 = ::clock(); // Create the hash of properties map_str_int_type hash_ret; get_properties(argc - 1, &argv[1], hash_ret); clock_t cend1 = ::clock(); double ctaken1 = (double) (cend1 - cstart1) / (double)CLOCKS_PER_SE +C; std::cerr << "get_properties CPU time : " << ctaken1 << " secs +\n"; clock_t cstart2 = ::clock(); // To avoid calling sort(), try creating an inverted std::set conta +iner // Note: negative count gives desired ordering (just like Perl GRT +sort :) set_int_str_type s; for ( auto const& kv : hash_ret ) { // Note: emplace_hint() was a lot faster than emplace() s.emplace_hint( s.end(), kv.second, kv.first ); } clock_t cend2s = ::clock(); // Output the (already sorted) std::set - no sort() function requir +ed! // Note: fix up negative count via -n.first for ( auto const& n : s ) { std::cout << n.second << '\t' << -n.fir +st << '\n'; } clock_t cend2 = ::clock(); time_t tend2 = ::time(NULL); long ttaken = static_cast<long>(::difftime(tend2, tstart1) + 0 +.5); double ctaken = (double) (cend2 - cstart1) / (double)CLOCKS_PER_ +SEC; double ctaken2s = (double) (cend2s - cstart2) / (double)CLOCKS_PER +_SEC; double ctaken2o = (double) (cend2 - cend2s) / (double)CLOCKS_PER +_SEC; std::cerr << "emplace set sort CPU time : " << ctaken2s << " sec +s\n"; std::cerr << "write stdout CPU time : " << ctaken2o << " sec +s\n"; std::cerr << "total CPU time : " << ctaken << " sec +s\n"; std::cerr << "total wall clock time : " << ttaken << " sec +s\n"; // Hack to see Private Bytes in Windows Task Manager (uncomment nex +t line so process doesn't exit too quickly) // std::this_thread::sleep_for(std::chrono::milliseconds(90000000 +)); return 0; }

      I was pleasantly surprised to see this improved my fastest C++ solution from 9 seconds down to 7:

      > llil2grt big1.txt big2.txt big3.txt >llil2grt.tmp llil2grt start get_properties CPU time : 4.252 secs emplace set sort CPU time : 1.282 secs write stdout CPU time : 1.716 secs total CPU time : 7.254 secs total wall clock time : 7 secs Memory use (Windows Private Bytes): 1,626,728K

      Though faster, it's no surprise memory use increased because instead of sorting in place, you're creating a new set_int_str_type container.

      Updated: Minor cosmetic changes were made to llil2grt.cpp after posting. Added "Sorting Algorithms" section with a bit more information on stable sorts.

Re: Rosetta Code: Long List is Long :awk(1)+sort(1)
by parv (Vicar) on Dec 12, 2022 at 07:59 UTC

    In awk(1) + sort(1) (environment) ...

    #!/bin/sh # Source: https://perlmonks.org/index.pl?node_id=11148773 # # This is one shell implementation based on the problem specification +... # # Rosetta Code: Long List is Long, 20221130, # by eyepopslikeamosquito # https://perlmonks.org/?node_id=11148465 case $# in 0 ) printf "Give a list of files to sort.\n" >&2 exit 1 ;; esac start=$( date '+%s' ) # Takes ~135 s. awk ' \ { cat_count[ $1 ] += $2 } \ END \ { for ( cat in cat_count ) \ { printf "%s\t%s\n", cat, cat_count[ cat ] } \ } \ ' $@ \ | sort -k2,2rn -k1,1 end=$( date '+%s' ) printf "total time: %d s\n" $(( end - start )) >&2

      I tried LANG=C and sorting individually (two sorts).

      Results from a Linux box:

      54 seconds LANG=en_US.UTF-8 33 seconds LANG=C sort -k2,2rn -k1,1 23 seconds LANG=C sort -k1,1 | sort -k2,2rn

      Testing:

      #!/bin/sh # https://www.perlmonks.org/?node_id=11148773 if [ $# -eq 0 ]; then printf "Give a list of files to sort.\n" >&2 exit 1 fi LANG=C awk ' { cat_count[ $1 ] += $2 } END { for ( cat in cat_count ) printf "%s\t%s\n", cat, cat_count[ cat ] } ' $@ \ | sort -k1,1 | sort -k2,2rn printf "total time: %d s\n" $SECONDS >&2

        Good point about using LANG=C (to make for a fairer comparison for I had set ascii encoding to parse the input (but not during sorting🤔) in my Python version).

        With that change, takes ~99 s; that and 2 sorts, takes ~60 s.

Re: Rosetta Code: Long List is Long
by Anonymous Monk on Dec 02, 2022 at 00:22 UTC

    Maybe it is not worth anything, but just exploiting a loophole of ridiculous 6 bytes a..z fixed length keys, and count fitting single byte. $r is just to quickly convert decimal to/from base-27, nothing more. Sure it all could be optimized. Excluding 0 from base-27 representation (i.e. using 27 instead of 26) is to avoid padding with leading zeroes. It follows that $buf_in length is wasteful 387,420,489 instead of 308,915,776, but, what the heck, I can be royally wasteful with this implementation. Decrementing count down from 255 instead of just incrementing from 0 is further silly optimization which I see this late time of day, but sure oversee better improvements. + Of course $MAX is 27**6 (or would be 26**6).

    Can't confirm what eyepopslikeamosquito says about memory, with original llil.pl I see Working Set goes to approx. 2.9 GB. Mine doesn't exceed ~530 Mb.

    llil start get_properties : 15 secs sort + output : 103 secs total : 118 secs my_test start get_properties : 31 secs sort + output: 10 secs total: 41 secs

    ...

    use strict; use warnings; use feature 'say'; use Math::GMPz ':mpz'; use Sort::Packed 'sort_packed'; @ARGV or die "usage: $0 file...\n"; my @llil_files = @ARGV; warn "my_test start\n"; my $tstart1 = time; my $r = Rmpz_init; Rmpz_set_str( $r, 'zzzzzz' =~ tr/a-z1-9/1-9a-z/r, 27 ); my $MAX = Rmpz_get_ui( $r ); my ( $buf_in, $buf_out ) = ( "\xFF" x $MAX, '' ); for my $fname ( @llil_files ) { open( my $fh, '<', $fname ) or die "error: open '$fname': $!"; while ( <$fh> ) { chomp; my ( $word, $count ) = split /\t/; $word =~ tr/a-z1-9/1-9a-z/; Rmpz_set_str( $r, $word, 27 ); vec( $buf_in, Rmpz_get_ui( $r ), 8 ) -= $count; } close( $fh ) or die "error: close '$fname': $!"; } while ( $buf_in =~ /[^\xFF]/g ) { Rmpz_set_ui( $r, @- ); $buf_out .= pack 'aa6', $&, Rmpz_get_str( $r, 27 ) =~ tr/1-9a-z/a- +z/r } my $tend1 = time; warn "get_properties : ", $tend1 - $tstart1, " secs\n"; my $tstart2 = time; sort_packed C7 => $buf_out; while ( $buf_out ) { my ( $count, $word ) = unpack 'Ca6', substr $buf_out, 0, 7, ''; printf "%s\t%d\n", $word, 255 - $count } my $tend2 = time; warn "sort + output: ", $tend2 - $tstart2, " secs\n"; warn "total: ", $tend2 - $tstart1, " secs\n";

    What follows is fiction, not implemented in code, can be ignored. I said 'ridiculous' above, but in fact I do remember original LLIL thread, not sure now but then I thought keys were expected significantly longer than qw/foo bar aaaaaa/, etc. (genetic sequences?). So then this would be multi-GB total of input files which are mostly keys, just keeping them keys in RAM is out of the question. Not to mention building and keeping hashes and working with them.

    I thought about HQ hashing (xxHash?) of keys and sparsely storing (Judy?) values indexed by produced integer, where values are e.g. 64-bit-packed integer comprised of

    • file id
    • offset of start of line containing unique key first seen (tell)
    • count (updated as files are read in)

    After all files are consumed, value positions within array (i.e. indexes) are no longer important. IF densely-packed array data (i.e. discard zero values) fits RAM, then problem is solved. Sort packed data, and produce output, which, yes, would require randomly reading lines (i.e. real keys) from input files AGAIN based on stored file id and line position.

      For reference, results of llil2d.pl (11148585) for my hardware are:

      llil2d start get_properties : 10 secs sort + output : 21 secs total : 31 secs 2317524 Kbytes of RAM were used

      (a report about resident RAM size was produced with same few lines of code as in script below)

      In fact, "words" are so short (and few) that I realized (later, after I wrote 11148660 with results there) that I don't need any indirection and can simply use Judy::HS: "words" are kept in RAM all the time, both in my in-memory flat "DB" and, opaquely, somewhere in Judy code.

      And then there's solution in Perl, which is both faster and requires much less memory, and generates identical output:

      my_test start get_properties : 13 secs sort + output : 7 secs total : 20 secs 349124 Kbytes of RAM were used

      Short integer to keep count is not required with this example, it only demonstrates count can be any length and not limited to a byte as in my previous "solution" in this thread. Same about 10 bytes to pad "words" to: it can be anything to fit longest word; words can be different in length.

      use strict; use warnings; use Judy::HS qw/ Set Get Free /; use Sort::Packed 'sort_packed'; my $DATA_TEMPLATE = 'nZ10'; my $DATA_SIZE = 12; my $COUNT_SIZE_BYTES = 2; my $COUNT_SIZE_BITS = 16; my $COUNT_MAX = int( 2 ** $COUNT_SIZE_BITS - 1 ); @ARGV or die "usage: $0 file...\n"; my @llil_files = @ARGV; warn "my_test start\n"; my $tstart1 = time; my ( $data, $current ) = ( '', 0 ); my $judy; for my $fname ( @llil_files ) { open( my $fh, '<', $fname ) or die $!; while ( <$fh> ) { chomp; my ( $word, $count ) = split /\t/; ( undef, my $val ) = Get( $judy, $word ); if ( defined $val ) { vec( $data, $val * $DATA_SIZE / $COUNT_SIZE_BYTES, $COUNT_SIZE_BITS ) -= $count } else { $data .= pack $DATA_TEMPLATE, $COUNT_MAX - $count, $word; Set( $judy, $word, $current ); $current ++ } } } Free( $judy ); my $tend1 = time; warn "get_properties : ", $tend1 - $tstart1, " secs\n"; my $tstart2 = time; sort_packed "C$DATA_SIZE", $data; while ( $data ) { my ( $count, $word ) = unpack $DATA_TEMPLATE, substr $data, 0, $DA +TA_SIZE, ''; printf "%s\t%d\n", $word, $COUNT_MAX - $count } my $tend2 = time; warn "sort + output : ", $tend2 - $tstart2, " secs\n"; warn "total : ", $tend2 - $tstart1, " secs\n"; use Memory::Usage; my $m = Memory::Usage-> new; $m-> record; warn $m-> state-> [0][3], " Kbytes of RAM were used\n";

      What if "words" are significantly longer? With approx. 10e6 unique words in this test, if they were each hundreds of bytes, then several GB of RAM would be used just to keep them. Perhaps impractical.

      So I'm returning to my idea of keeping only hashes and offsets within files. Results I posted in 11148660 are of course valid, I was using 64-bit hashes as Judy::L integer keys.

      In fact 32-bit hashes generated with xxHash have a few collisions for 10e6 6-character words. 64-bit hashes have no collisions. I'm not qualified to predict how safe it is to expect no collisions for set of which size. Maybe 128-bit hashes, below, are overkill.

      Just for entertainment I decided to write "indirect" solution with 128-bit hashes. Therefore I'm not using Judy::L, but same Judy::HS with keys being 32-char hex-encoded hashes. Otherwise, almost same plan and code. Data template layout chosen arbitrarily and can be adjusted.

      Of course, words are not alphabetically sorted in output, but OTOH it was not original LLIL requirement. Results:

      my_test start get_properties : 21 secs sort + output : 23 secs total : 44 secs 841880 Kbytes of RAM were used

      I think same amount of RAM would be used if words were not 6 but 600 or 6000 bytes. Relatively fast 2nd phase (considering huge amount of random reads) is due to NMVe storage here.

      (Off topic: I managed to install Crypt::xxHash under Windows/Strawberry, but Judy was too much challenge for me. I wrote solution very close to code below, using not Judy, but very thin Inline::C wrapper for AVL library (google for jsw_avltree). It uses approx. same RAM and ~2x time for 1st phase. But also ~2.5x time for 2nd phase, which is exactly same code with same hardware (dual boot). Don't know if Windows I/O is just so much slower)

      use strict; use warnings; use Judy::HS qw/ Set Get Free /; use Crypt::xxHash 'xxhash3_128bits_hex'; use Sort::Packed 'sort_packed'; my $DATA_TEMPLATE = 'nnNn'; # word count # file index # word position # word length my $DATA_SIZE = 10; my $COUNT_SIZE_BYTES = 2; my $COUNT_SIZE_BITS = 16; my $COUNT_MAX = int( 2 ** $COUNT_SIZE_BITS - 1 ); @ARGV or die "usage: $0 file...\n"; my @llil_files = @ARGV; warn "my_test start\n"; my $tstart1 = time; my ( $data, $current ) = ( '', 0 ); my $judy; for my $idx ( 0 .. $#llil_files ) { open( my $fh, '<', $llil_files[ $idx ]) or die $!; until ( eof $fh ) { my $pos = tell $fh; $_ = <$fh>; chomp; my ( $word, $count ) = split /\t/; my $xx = xxhash3_128bits_hex( $word, 0 ); ( undef, my $val ) = Get( $judy, $xx ); if ( defined $val ) { vec( $data, $val * $DATA_SIZE / $COUNT_SIZE_BYTES, $COUNT_SIZE_BITS ) -= $count } else { $data .= pack $DATA_TEMPLATE, $COUNT_MAX - $count, $idx, $pos, length $word; Set( $judy, $xx, $current ); $current ++ } } } Free( $judy ); my $tend1 = time; warn "get_properties : ", $tend1 - $tstart1, " secs\n"; my $tstart2 = time; sort_packed "C$DATA_SIZE", $data; my @fh; open $fh[ $_ ], '<', $llil_files[ $_ ] for 0 .. $#llil_files; while ( $data ) { my ( $count, $idx, $pos, $len ) = unpack $DATA_TEMPLATE, substr $data, 0, $DATA_SIZE, ''; sysseek $fh[ $idx ], $pos, 0; sysread $fh[ $idx ], my( $word ), $len; printf "%s\t%d\n", $word, $COUNT_MAX - $count } my $tend2 = time; warn "sort + output : ", $tend2 - $tstart2, " secs\n"; warn "total : ", $tend2 - $tstart1, " secs\n"; use Memory::Usage; my $m = Memory::Usage-> new; $m-> record; warn $m-> state-> [0][3], " Kbytes of RAM were used\n";

        Thank you, Anonymous Monk! I tried-and-enjoyed your solution to include doubling the data size e.g. 32-bit count and 20 key length.

        $ diff llil_judyhs1.pl llil_judyhs2.pl 7,10c7,10 < my $DATA_TEMPLATE = 'nZ10'; < my $DATA_SIZE = 12; < my $COUNT_SIZE_BYTES = 2; < my $COUNT_SIZE_BITS = 16; --- > my $DATA_TEMPLATE = 'NZ20'; > my $DATA_SIZE = 24; > my $COUNT_SIZE_BYTES = 4; > my $COUNT_SIZE_BITS = 32;
        $ time perl llil_judyhs1.pl big1.txt big2.txt big3.txt >out1.txt my_test start get_properties : 10 secs sort + output : 5 secs total : 15 secs 353468 Kbytes of RAM were used real 0m14.770s user 0m14.669s sys 0m0.084s $ time perl llil_judyhs2.pl big1.txt big2.txt big3.txt >out2.txt my_test start get_properties : 10 secs sort + output : 5 secs total : 15 secs 473784 Kbytes of RAM were used real 0m15.073s user 0m14.938s sys 0m0.119s

        Thanks anonymonk. Excellent work!

        Though I've never used them, I've heard good things about Judy Arrays and maintain a list of references on them at PM. Might get around to actually using them one day. :)

        What if "words" are significantly longer? With approx. 10e6 unique words in this test, if they were each hundreds of bytes, then several GB of RAM would be used just to keep them. Perhaps impractical.

        Good question! Apologies, my initial test file generator was very primitive. To try to help answer your question I've quickly whipped up a test file generator that generates much longer keys (up to around 200 characters in length) and longer counts too. I was conservative with the counts because I didn't want to disqualify folks using 32-bit ints.

        # gen-long-llil.pl # Crude program to generate a LLiL test file with long names and count +s # perl gen-long-llil.pl long1.txt 600 use strict; use warnings; use autodie; { my $ordmin = ord('a'); my $ordmax = ord('z') + 1; # Generate a random word sub gen_random_word { my $word = shift; # word prefix my $nchar = shift; # the number of random chars to append for my $i (1 .. $nchar) { $word .= chr( $ordmin + int( rand($ordmax - $ordmin) ) ); } return $word; } } my $longworda = join '', 'a' .. 'z'; my $longwordz = join '', reverse('a' .. 'z'); my $longcount = 1_000_000; sub create_long_test_file { my $fname = shift; my $howmany = shift; open( my $fh_out, '>', $fname ); # Some with no randomness for my $h ( 1 .. $howmany ) { for my $i ( 1 .. 8 ) { my $cnt = $longcount + $i - 1; my $worda = $longworda x $i; my $wordz = $longwordz x $i; print {$fh_out} "$worda\t$cnt\n$wordz\t$cnt\n"; } } # Some with randomness my $wordlen = 1; for my $h ( 1 .. $howmany ) { for my $i ( 1 .. 8 ) { my $cnt = $longcount + $i - 1; my $worda = $longworda x $i; my $wordz = $longwordz x $i; for my $c ( 'a' .. 'z' ) { for my $z ( 1 .. 2 ) { print {$fh_out} $worda . gen_random_word( $c, $wordlen +) . "\t" . (1000000 + $z) . "\n"; print {$fh_out} $wordz . gen_random_word( $c, $wordlen +) . "\t" . (1000000 + $z) . "\n"; } } } } } my $outfile = shift; my $count = shift; $outfile or die "usage: $0 outfile count\n"; $count or die "usage: $0 outfile count\n"; $count =~ /^\d+$/ or die "error: count '$count' is not a number\n"; print "generating short long test file '$outfile' with count '$count'\ +n"; create_long_test_file( $outfile, $count ); print "file size=", -s $outfile, "\n";

        I ran it like this:

        > perl gen-long-llil.pl long1.txt 600 generating short long test file 'long1.txt' with count '600' file size=65616000 > perl gen-long-llil.pl long2.txt 600 generating short long test file 'long2.txt' with count '600' file size=65616000 > perl gen-long-llil.pl long3.txt 600 generating short long test file 'long3.txt' with count '600' file size=65616000

        Then reran my two biggish benchmarks with a mixture of files:

        > perl llil2d.pl big1.txt big2.txt big3.txt long1.txt long2.txt long3. +txt >perl2.tmp llil2d start get_properties : 11 secs sort + output : 23 secs total : 34 secs > llil2a big1.txt big2.txt big3.txt long1.txt long2.txt long3.txt >cpp +2.tmp llil2 start get_properties : 6 secs sort + output : 5 secs total : 11 secs > diff cpp2.tmp perl2.tmp

        Improved test file generators welcome.

        Updated Test File Generators

        These were updated to allow a "\n" (rather than "\r\n") on Windows after this was pointed out here. Curiously, \n seems to be slower than \r\n on Windows if you don't set binmode! I am guessing that chomp is slower with \n than with \r\n on a Windows text stream.

        gen-llil.pl

        # gen-llil.pl # Crude program to generate a big LLiL test file to use in benchmarks # On Windows running: # perl gen-llil.pl big2.txt 200 3 - produces a test file with size + = 35,152,000 bytes # (lines terminated with "\r\n") # perl gen-llil.pl big2.txt 200 3 1 - produces a test file with size + = 31,636,800 bytes # (lines terminated with "\n") # On Unix, lines are terminated with "\n" and the file size is always +31,636,800 bytes use strict; use warnings; use autodie; { my $ordmin = ord('a'); my $ordmax = ord('z') + 1; # Generate a random word sub gen_random_word { my $word = shift; # word prefix my $nchar = shift; # the number of random chars to append for my $i (1 .. $nchar) { $word .= chr( $ordmin + int( rand($ordmax - $ordmin) ) ); } return $word; } } sub create_test_file { my $fname = shift; my $count = shift; my $wordlen = shift; my $fbin = shift; open( my $fh_out, '>', $fname ); $fbin and binmode($fh_out); for my $c ( 'aaa' .. 'zzz' ) { for my $i (1 .. $count) { print {$fh_out} gen_random_word( $c, $wordlen ) . "\t" . 1 . +"\n"; } } } my $outfile = shift; my $count = shift; my $wordlen = shift; my $fbin = shift; # default is to use text stream (not a binary +stream) defined($fbin) or $fbin = 0; $outfile or die "usage: $0 outfile count wordlen\n"; $count or die "usage: $0 outfile count wordlen\n"; print "generating test file '$outfile' with count '$count' (binmode=$f +bin)\n"; create_test_file($outfile, $count, $wordlen, $fbin); print "file size=", -s $outfile, "\n";

        gen-long-llil.pl

        # gen-long-llil.pl # Crude program to generate a LLiL test file with long names and count +s # perl gen-long-llil.pl long1.txt 600 # On Windows running: # perl gen-long-llil.pl long1.txt 600 - produces a test file with s +ize = 65,616,000 bytes # (lines terminated with "\r\n +") # perl gen-long-llil.pl long1.txt 600 - produces a test file with s +ize = 65,107,200 bytes # (lines terminated with "\n") # On Unix, lines are terminated with "\n" and the file size is always +65,107,200 bytes use strict; use warnings; use autodie; { my $ordmin = ord('a'); my $ordmax = ord('z') + 1; # Generate a random word sub gen_random_word { my $word = shift; # word prefix my $nchar = shift; # the number of random chars to append for my $i (1 .. $nchar) { $word .= chr( $ordmin + int( rand($ordmax - $ordmin) ) ); } return $word; } } my $longworda = join '', 'a' .. 'z'; my $longwordz = join '', reverse('a' .. 'z'); my $longcount = 1_000_000; sub create_long_test_file { my $fname = shift; my $howmany = shift; my $fbin = shift; open( my $fh_out, '>', $fname ); $fbin and binmode($fh_out); # Some with no randomness for my $h ( 1 .. $howmany ) { for my $i ( 1 .. 8 ) { my $cnt = $longcount + $i - 1; my $worda = $longworda x $i; my $wordz = $longwordz x $i; print {$fh_out} "$worda\t$cnt\n$wordz\t$cnt\n"; } } # Some with randomness my $wordlen = 1; for my $h ( 1 .. $howmany ) { for my $i ( 1 .. 8 ) { my $cnt = $longcount + $i - 1; my $worda = $longworda x $i; my $wordz = $longwordz x $i; for my $c ( 'a' .. 'z' ) { for my $z ( 1 .. 2 ) { print {$fh_out} $worda . gen_random_word( $c, $wordlen +) . "\t" . (1000000 + $z) . "\n"; print {$fh_out} $wordz . gen_random_word( $c, $wordlen +) . "\t" . (1000000 + $z) . "\n"; } } } } } my $outfile = shift; my $count = shift; my $fbin = shift; # default is to use text stream (not a binary +stream) defined($fbin) or $fbin = 0; $outfile or die "usage: $0 outfile count\n"; $count or die "usage: $0 outfile count\n"; $count =~ /^\d+$/ or die "error: count '$count' is not a number\n"; print "generating short long test file '$outfile' with count '$count' +(binmode=$fbin)\n"; create_long_test_file( $outfile, $count, $fbin ); print "file size=", -s $outfile, "\n";

        Updated this node with new test file generators so you can generate test files that are the same size on Unix and Windows. That is, by setting $fbin you can make the line ending "\n" on Windows, instead of "\r\n". See Re^2: Rosetta Code: Long List is Long (faster) for more background.

Re: Rosetta Code: Long List is Long (faster)
by Anonymous Monk on Dec 27, 2022 at 14:47 UTC
    Please feel free to respond away with solutions...

    Short version: J script runs in ~5.7 sec using ~650 MB to produce exactly same output. Which makes it fastest of all solutions so far.

    (The caveat: "words" are supposed to be not too different in length, they are temporarily padded to longest during program run (i.e. not padded at all with current test sample), expect deterioration if, say, their lengths differ 2 or more orders of magnitude or so, especially if just a few are very long. I didn't check.)

    ###################### (Prose and previous results, can be skipped)

    I was shocked to notice this thread is almost a month old already. While I'm in no hurry and have been pursuing what follows at leisure and only rarely (kind of "for dessert"), it's better to publish this at long last, be it "final" optimized version or not (I'm sure it can be improved a lot), before the thread is dead cold and whoever participated have to make effort to read their own code because of time elapsed.

    As a reference frame, here are assortment of results of previous solutions, with my hardware:

    llil2grt.pl (see 11148713) (fastest "pure core-only Perl"), Windows:

    llil2grt start get_properties : 16 secs sort + output : 31 secs total : 47 secs Peak working set (memory): 2,744,200K

    Same code & data & PC, Linux: (I didn't investigate why such difference.)

    llil2grt start get_properties : 11 secs sort + output : 20 secs total : 31 secs 2,152,848 Kbytes of RAM were used

    I assume that Judy (see 11148585) is best in both speed and memory for non-parallel Perl solutions, with same caveat as at the very top: "words" are temporarily padded to fixed length e.g. here to 10 bytes.:

    my_Judy_test start get_properties : 13 secs sort + output : 7 secs total : 20 secs 349,124 Kbytes of RAM were used

    Being lazy bum, I didn't compile C++ solutions (nor do I code in C++), here is a copy-paste from 11148969, I assume it is the best result so far, among C++ and all others: (For my PC, I expect time to be worse.)

    llil2grt start get_properties CPU time : 4.252 secs emplace set sort CPU time : 1.282 secs write stdout CPU time : 1.716 secs total CPU time : 7.254 secs total wall clock time : 7 secs Memory use (Windows Private Bytes): 1,626,728K

    ###################### (End of prose and previous results)

    Code below generates next message with RAM usage taken from Windows Task Manager (to be on par with how it was measured for Perl), while script pauses for a key (Ctrl-D or Ctrl-Z + Enter combo as usual for Linux or Windows, respectively) after finish:

    Read and parse input: 1.636 Classify, sum, sort: 2.206 Format and write output: 1.895 Total time: 5.737 Finished. Waiting for a key... Peak working set (memory): 657,792K

    The injection of CR into output lines is only required on Windows (actually, not required at all) to later ensure no difference with output from Perl. The "magic constant" 3 for number width can be any, and is only used for intermediate step.

    I had to make this code slightly less readable than it was during development, by somewhat aggressively re-using over and over again same variable names for words and nums, as data are processed and modified as script progresses. They were different "self-explanatory" names at each stage, but because arrays are huge, it's better to immediately over-write variable on successive assignments to conserve memory. "Erasing" throw-away helper arrays (similar to undef in Perl) serves same purpose.

    Actually, during development I was playing with this toy dataset, here's original data and result:

       text =: noun define
    tango	1
    charlie	2
    bravo	1
    foxtrot	4
    alpha	3
    foxtrot	1
    bravo	1
    foxtrot	7
    )
    
        NB. Do work here...
        
        ] text
    foxtrot	12
    alpha	3
    bravo	2
    charlie	2
    tango	1
    

    The script:

    NB. ----------------------------------------------------------- NB. --- This file is "llil.ijs" NB. --- Run as e.g.: NB. NB. jconsole.exe llil.ijs big1.txt big2.txt big3.txt out.txt NB. NB. --- (NOTE: last arg is output filename, file is overwritten) NB. ----------------------------------------------------------- args =: 2 }. ARGV fn_out =: {: args fn_in =: }: args NUM_LENGTH =: 3 PAD_CHAR =: ' ' make_sel =: [: (1 2 0& |:) @ ,: ([ ,. ] - [: , [) sort_some =: ([: /:~ {)`[`] } text =: , freads " 0 fn_in lf_pos =: I. text = LF tab_pos =: I. text = TAB words =: ((0 , >: }: lf_pos) make_sel tab_pos) ];.0 text nums =: 0&". (tab_pos make_sel lf_pos) ; @: (<;.0) text erase 'text' ; 'lf_pos' ; 'tab_pos' t1 =: (6!:1) '' NB. time since engine start nums =: words +//. nums words =: ~. words 'words nums' =: (\: nums)& { &.:>"_1 words ; nums starts =: I. ~: nums ranges =: starts ,. (}. starts , # nums) - starts count =: # starts sort_words =: monad define 'ranges words' =. y range =. ({. + i. @ {:) {. ranges (}. ranges) ; range sort_some words ) words =: > {: sort_words ^: count ranges ; words erase 'starts' ; 'ranges' t2 =: (6!:1) '' NB. time since engine start nums =: (- NUM_LENGTH) ]\ NUM_LENGTH ": nums text =: , words ,. TAB ,. (nums ,. CR) ,. LF erase 'words' ; 'nums' text =: (#~ ~: & PAD_CHAR) text text fwrite fn_out erase < 'text' t3 =: (6!:1) '' NB. time since engine start echo 'Read and parse input: ' , ": t1 echo 'Classify, sum, sort: ' , ": t2 - t1 echo 'Format and write output: ' , ": t3 - t2 echo 'Total time: ' , ": t3 echo '' echo 'Finished. Waiting for a key...' stdin '' exit 0

      I was shocked to notice this thread is almost a month old already ... it's better to publish this at long last, be it "final" optimized version or not (I'm sure it can be improved a lot), before the thread is dead cold and whoever participated have to make effort to read their own code because of time elapsed

      Thanks for posting. I suggest you just take your time and enjoy it without worrying too much about how old the thread is. This is actually one of my favourite features of Perl Monks, so much so that I wrote a meditation about it: Necroposting Considered Beneficial. :)

      Your reply motivated me into finally getting around to installing Linux on my newish Windows laptop. Being lazy, I did this with a single wsl --install command to install the default Ubuntu distribution of Linux from the Microsoft Store. AFAIK, the main alternative is to install VMware, followed by multiple different Linux distros.

      Anyways, after doing that I saw similar performance of both my fastest Perl and C++ versions.

      For llil2grt.cpp on Windows 11:

      llil2grt start get_properties CPU time : 4.252 secs emplace set sort CPU time : 1.282 secs write stdout CPU time : 1.716 secs total CPU time : 7.254 secs total wall clock time : 7 secs

      On Ubuntu WSL2 Linux (5.15.79.1-microsoft-standard-WSL2), running on the same hardware, compiled with the identical g++ -o llil2grt -std=c++11 -Wall -O3 llil2grt.cpp, we see it runs slightly faster:

      llil2grt start get_properties CPU time : 3.63153 secs emplace set sort CPU time : 1.09085 secs write stdout CPU time : 1.41164 secs total CPU time : 6.13412 secs total wall clock time : 6 secs

      Sadly, it's becoming increasingly obvious that to make this run significantly faster, I'll probably have to change the simple and elegant:

      hash_ret[word] -= count;
      into something much uglier, possibly containing "emplace_hint" or other std::map claptrap ... and I just can't bring myself to do that. :) More attractive is to leave the simple and elegant one-liner alone and instead try to inject a faster custom std::map memory allocator (I have no idea how to do that yet).

      Conversely, my fastest Perl solution llil2grt.pl runs slightly slower on Ubuntu:

      llil2grt start get_properties : 10 secs sort + output : 22 secs total : 32 secs
      compared to 10, 20, 30 secs on Windows 11. Perl v5.34.0 on Linux vs Strawberry Perl v5.32.0 on Windows.

      Update: while this shortened version llil2cmd-long.pl runs a bit faster on Ubuntu (but not on Windows):

      llil2cmd-long start get_properties : 7 secs sort + output : 21 secs total : 28 secs

      The injection of CR into output lines is only required on Windows (actually, not required at all)

      Yes, you're right, Windows nowadays seems perfectly happy with text files terminated with \n rather than the traditional DOS \r\n. By default, Perl and C++ both output text files with "\n" on Unix and "\r\n" on Windows. I'll update my test file generator to generate identical "\n" terminated files on both Unix and Windows.

      Update: Test file generators updated here: Re^3: Rosetta Code: Long List is Long (Test File Generators). Curiously, \n seems to be slower than \r\n on Windows if you don't set binmode! I am guessing that chomp is slower with \n than with \r\n on a Windows text stream.

      Update: Are you running Strawberry Perl on Windows? Which version? (Trying to understand why your Windows Perl seems slower than mine).

      Update: The processor and SSD disk (see Novabench top scoring disks) on my HP laptop:

      Intel(R) Core(TM) i7-1065G7 CPU @ 1.30GHz, 1501 Mhz, 4 Core(s), 8 Logi +cal Processor(s) Disk (238 GB SSD): Intel Optane+238GBSSD (ave sequential read 513 MB/ +s; ave sequential write 280 MB/s) - score 73

      References Added Later

      Updated: Noted that llil2grt.pl on Ubuntu Linux runs slightly slower on Linux than Windows, along with detailed timings. Clarified that I'm running Windows 11. Added more detail on my laptop hardware. Mentioned llil2cmd-long.pl, developed later.

        By "same PC" to run a test under Windows vs. Linux, I meant classic dual boot and GRUB as I have, but perhaps virtualization is no longer a hindrance, and, moreover, not in this case. Because I compiled llil2grt.cpp in both OSes (command line and options you advised), here's what I got:

        $ ./llil2grt big1.txt big2.txt big3.txt >out.txt llil2grt start get_properties CPU time : 3.67015 secs emplace set sort CPU time : 1.19724 secs write stdout CPU time : 1.52074 secs total CPU time : 6.3882 secs total wall clock time : 6 secs >llil2grt.exe big1.txt big2.txt big3.txt >out.txt llil2grt start get_properties CPU time : 5.577 secs emplace set sort CPU time : 1.675 secs write stdout CPU time : 2.484 secs total CPU time : 9.736 secs total wall clock time : 10 secs

        Same relative difference I observe running Perl script. So looks like it's not an issue of Perl and Strawberry version you asked me about (which is latest available "5.32.1.1-64bit-PDL", BTW). I, further, compiled llil2grt.cpp using minGW shell and g++ which came with older 5.26 Strawberry, and got same 10 secs.

        I'm clueless why this PC is slow with Windows. Perhaps either MS or Dell issued a patch in recent years to address Kaby Lake CPU (mine is i5-7500T) "vulnerability"? I vaguely remember it was said performance would suffer if such patch to guard against mythical attackers would be applied. Just a guess, sorry if it's ridiculous. On the other hand, J script performs the same in both OSes.

        Speaking of which, to return to "interpreted language is faster than compiled C++" -- of course it's dataset bias to a large extent. I ran test with "long" files from another test file generator you suggested previously:

        $ ./llil2grt long1.txt long2.txt long3.txt >out.txt llil2grt start get_properties CPU time : 0.559273 secs emplace set sort CPU time : 0.004393 secs write stdout CPU time : 0.003275 secs total CPU time : 0.567 secs total wall clock time : 1 secs $ jconsole llil.ijs long1.txt long2.txt long3.txt out_j.txt Read and parse input: 1.50791 Classify, sum, sort: 0.70953 Format and write output: 0.00430393 Total time: 2.22175 $ diff out.txt out_j.txt $

        (NUM_LENGTH changed to 10, otherwise same code)

      I tried the J script demonstration, by Anonymous Monk, on a box running Clear Linux. The following are the steps taken.

      Installation:

      cd ~/Downloads wget http://www.jsoftware.com/download/j903/install/j903_linux64.tar.g +z tar xzf j903_linux64.tar.gz -C ~/

      Script modification:

      Next, I made a change to not output CR so able to compare against original output without CR. For some reason, pressing a key or the return key does not exit the script (resolved by removing two lines).

      $ diff llil.ijs llil2.ijs 50c50 < text =: , words ,. TAB ,. (nums ,. CR) ,. LF --- > text =: , words ,. TAB ,. nums ,. LF 64,65d63 < echo 'Finished. Waiting for a key...' < stdin ''

      Non-AVX results:

      $ ~/j903/bin/jconsole llil2.ijs big1.txt big2.txt big3.txt out.txt Read and parse input: 0.850145 Classify, sum, sort: 2.08596 Format and write output: 1.24986 Total time: 4.18597

      AVX2 results:

      From the wiki, "The initial installation is non-AVX and finishing the steps upgrades to AVX or AVX2 as appropriate."

      # Update J engine to AVX or AVX2, depending on the CPU. $ ~/j903/updateje.sh Engine: j903/j64avx2/linux Release-b: commercial/2022-01-28T04:13:29 Library: 9.03.08 Platform: Linux 64 Installer: J903 install InstallPath: /home/mario/j903 Contact: www.jsoftware.com $ ~/j903/bin/jconsole llil2.ijs big1.txt big2.txt big3.txt out.txt Read and parse input: 0.789199 Classify, sum, sort: 1.56741 Format and write output: 1.12442 Total time: 3.48103

      Anonymous Monk, blessings and grace. Thank you, for the J Programming Language demonstration.

      J script runs in ~5.7 sec using ~650 MB to produce exactly same output. Which makes it fastest of all solutions so far.

      The good news is that I've now found two different C++ versions that are faster. The bad news is that they're much uglier than my original llil2grt.cpp (which ran in about 6.2 seconds on Ubuntu). After getting nowhere trying to speed up llil2grt.cpp, I reluctantly decided that the simple and elegant hash_ret[word] -= count had to go.

      First the timings on Ubuntu. Because there's quite a bit of natural variation between runs, I did three runs of each:

      llil2vec start get_properties CPU time : 3.06313 secs emplace set sort CPU time : 0.923435 secs write stdout CPU time : 1.392 secs total CPU time : 5.37868 secs total wall clock time : 6 secs llil2vec start get_properties CPU time : 3.1567 secs emplace set sort CPU time : 0.970294 secs write stdout CPU time : 1.22305 secs total CPU time : 5.35015 secs total wall clock time : 5 secs llil2vec start get_properties CPU time : 3.32019 secs emplace set sort CPU time : 1.08277 secs write stdout CPU time : 1.22461 secs total CPU time : 5.62766 secs total wall clock time : 5 secs

      Ave CPU time: 5.5 secs (Memory use (Windows Private Bytes): 1,225,580K)

      llil2vec (fixed string length=6) start get_properties CPU time : 2.09353 secs emplace set sort CPU time : 0.795144 secs write stdout CPU time : 1.20994 secs total CPU time : 4.09871 secs total wall clock time : 4 secs llil2vec (fixed string length=6) start get_properties CPU time : 2.2078 secs emplace set sort CPU time : 0.707252 secs write stdout CPU time : 1.14867 secs total CPU time : 4.06383 secs total wall clock time : 4 secs llil2vec (fixed string length=6) start get_properties CPU time : 2.39225 secs emplace set sort CPU time : 1.0033 secs write stdout CPU time : 1.22765 secs total CPU time : 4.62331 secs total wall clock time : 4 secs

      Ave CPU time: 4.3 secs (Memory use (Windows Private Bytes): 814,940K)

      The first one still uses std::string so there are no limitations on word length. The second one is limited to a word length of six characters, and so works with big1.txt, while crashing spectacularly with long1.txt.

      Funny, I'd originally forgotten that way back in 2014, when desperately trying to make a search program run a hundred times faster, I'd stumbled upon this classic quote:

      That's not a minor disadvantage, we're talking about things being 50 to 100 times slower with a linked list. Compactness matters. Vectors are more compact than lists. And predictable usage patterns matter enormously. With a vector, you have to shove a lot of elements over, but caches are really really good at that.

      When you traverse lists you keep doing random access ... so you are actually random accessing your memory and you are maximizing your cache misses, which is exactly the opposite of what you want.

      -- Bjarne Stroustrup: Why you should avoid Linked Lists (3:30-5:00) (update: see also)

      Back then, because each memory access into my (huge) 4GB lookup tables was essentially random, most memory accesses missed the L1 cache, missed the L2 cache, missed the L3 cache, then waited for the cache line to be loaded into all three caches from main memory, while often incurring a TLB cache miss, just to rub salt into the wounds.

      Thankfully, there are no 4 GB lookup tables in this problem. But hashes (and linked lists) still tend to be mind-bogglingly slower than vectors on modern hardware, as indicated by Stroustrup's quote above. So I reluctantly decided that the simple and elegant hash_ret[word] -= count had to go.

      Though I really hate this new std::vector based solution, it does run faster than my earlier std::map based one. This is just a first attempt, further improvements are possible. The llil2vec.cpp source code is shown below. Though there are two sets of timings, there is only one program, compiled with or without MAX_STR_LEN_L defined. Suggestions for improvements welcome.

      llil2vec.cpp

        For fixed string length (I ran several times), clang++ computes ~ 0.3 seconds faster compared to g++. See also, the J script result with AVX2 enabled (total time 3.48103 secs).

        # gcc 12.2.1 $ g++ -o llil2vec -std=c++11 -Wall -O3 -march=native -mtune=skylake -f +align-functions=32 -fno-semantic-interposition -mno-vzeroupper -mpref +er-vector-width=256 llil2vec.cpp $ ./llil2vec big1.txt big2.txt big3.txt >out.txt llil2vec (fixed string length=6) start get_properties CPU time : 1.89609 secs emplace set sort CPU time : 0.544972 secs write stdout CPU time : 0.842451 secs total CPU time : 3.28355 secs total wall clock time : 4 secs # clang version 15.0.4 $ clang++ -o llil2vec -std=c++11 -Wall -O3 -march=native -mtune=skylak +e -falign-functions=32 -fno-semantic-interposition -mno-vzeroupper -m +prefer-vector-width=256 llil2vec.cpp $ ./llil2vec big1.txt big2.txt big3.txt >out.txt llil2vec (fixed string length=6) start get_properties CPU time : 1.67073 secs emplace set sort CPU time : 0.474207 secs write stdout CPU time : 0.828457 secs total CPU time : 2.97343 secs total wall clock time : 3 secs

        Thanks to the excellent feedback from marioroy I've been able to improve the code as follows:

        • Compile MAX_STR_LEN_L version with clang++, otherwise g++
        • For MAX_STR_LEN_L version, reduced size of short strings by one char by not null-terminating them
        • Use good old printf to write to stdout (seems a bit faster than streaming to std::cout)
        • Minor code cleanups (especially around creating the inverted std::set)

        New Timings for Long String Version

        > ./llil2vec big1.txt big2.txt big3.txt >f.tmp llil2vec start get_properties CPU time : 3.06708 secs emplace set sort CPU time : 0.903617 secs write stdout CPU time : 1.23459 secs total CPU time : 5.20544 secs total wall clock time : 5 secs > ./llil2vec big1.txt big2.txt big3.txt >f.tmp llil2vec start get_properties CPU time : 3.4832 secs emplace set sort CPU time : 1.10209 secs write stdout CPU time : 0.939805 secs total CPU time : 5.52519 secs total wall clock time : 6 secs > ./llil2vec big1.txt big2.txt big3.txt >f.tmp llil2vec start get_properties CPU time : 3.42605 secs emplace set sort CPU time : 0.916222 secs write stdout CPU time : 1.00049 secs total CPU time : 5.343 secs total wall clock time : 5 secs > diff f.tmp vec.tmp

        This is an average time of 5.35 secs (compared to 5.5 secs in the original post).

        New Timings for Short String (MAX_STR_LEN_L=6) Version

        > For some reason, short string version is faster when compiled with c +lang++ > (while long string version is not) > clang++ -o llil2vec -std=c++11 -Wall -O3 llil2vec.cpp > ./llil2vec big1.txt big2.txt big3.txt >f.tmp llil2vec (fixed string length=6) start get_properties CPU time : 1.87217 secs emplace set sort CPU time : 0.589238 secs write stdout CPU time : 0.842179 secs total CPU time : 3.30369 secs total wall clock time : 3 secs > ./llil2vec big1.txt big2.txt big3.txt >f.tmp llil2vec (fixed string length=6) start get_properties CPU time : 1.95909 secs emplace set sort CPU time : 0.610479 secs write stdout CPU time : 0.959859 secs total CPU time : 3.52965 secs total wall clock time : 4 secs > ./llil2vec big1.txt big2.txt big3.txt >f.tmp llil2vec (fixed string length=6) start get_properties CPU time : 1.86549 secs emplace set sort CPU time : 0.608097 secs write stdout CPU time : 0.857176 secs total CPU time : 3.33087 secs total wall clock time : 3 secs > diff f.tmp vec.tmp

        This is an average time of 3.4 secs (compared to 4.3 secs in the original post).

        New version of llil2vec.cpp

        Interim OpenMP Versions

        Note: The interim versions below are just early (conservative) openmp-safe versions made thread-safe by expunging the dreaded strtok function.

        The one with the most potential for multi-threading seems to be llil3vec.cpp because it uses vectors which can be safely used lock-free in thread-local vec_loc vectors, as shown by marioroy in llil4vec.cpp.

        While replacing std::sort and std::stable_sort with __gnu_parallel versions to sort std::vectors is a no-brainer, it seems problematic to parallelise access to a global std::map (as used in llil3grt.cpp and llil3m.cpp below) because I think you'd need some sort of locking to safely allow multiple threads update a shared global std::map. Ideas welcome.

        Interim version of llil3vec.cpp

        Timings of llil3vec

        > g++ -o llil3vec -std=c++20 -Wall -O3 llil3vec.cpp > time ./llil3vec big1.txt big2.txt big3.txt >f.tmp llil3vec (fixed string length=6) start get_properties CPU time : 1.47895 secs emplace set sort CPU time : 0.592783 secs write stdout CPU time : 0.832071 secs total CPU time : 2.90392 secs total wall clock time : 3 secs real 0m3.217s user 0m2.806s sys 0m0.412s > diff f.tmp vec.tmp > g++ -o llil3vec -std=c++20 -fopenmp -Wall -O3 llil3vec.cpp > time ./llil3vec big1.txt big2.txt big3.txt >f.tmp llil3vec (fixed string length=6) start get_properties CPU time : 2.5809 secs emplace set sort CPU time : 0.793355 secs write stdout CPU time : 0.860855 secs total CPU time : 4.23521 secs total wall clock time : 3 secs real 0m2.673s user 0m4.073s sys 0m0.465s > diff f.tmp vec.tmp

        For the -fopenmp version note that the Unix real time is lower, but the user time is higher.

        Interim version of llil3grt.cpp

        Interim version of llil3m.cpp

        Timings

        llil3grt (fixed string length=6) start - openmp version get_properties CPU time : 2.34487 secs emplace set sort CPU time : 0.942098 secs write stdout CPU time : 0.855157 secs total CPU time : 4.14223 secs total wall clock time : 4 secs real 0m4.752s user 0m4.241s sys 0m0.511s

        $ time ./llil3m big1.txt big2.txt big3.txt >f.tmp llil3m (fixed string length=6) start - openmp version get_properties CPU time : 2.40089 secs vector copy CPU time : 0.544294 secs vector stable sort CPU time : 5.84981 secs write stdout CPU time : 0.805509 secs total CPU time : 9.6006 secs total wall clock : 4 secs real 0m4.713s user 0m9.238s sys 0m0.679s

        llil3m (fixed string length=6) start - openmp version get_properties CPU time : 2.3031 secs vector copy CPU time : 0.544613 secs vector sort CPU time : 2.64047 secs write stdout CPU time : 0.791186 secs total CPU time : 6.27946 secs total wall clock : 4 secs real 0m4.182s user 0m6.128s sys 0m0.473s

        $ time ./llil3m big1.txt big2.txt big3.txt >f.tmp llil3m (fixed string length=6) start get_properties CPU time : 2.28072 secs vector copy CPU time : 0.471632 secs vector stable sort CPU time : 0.735042 secs write stdout CPU time : 0.67514 secs total CPU time : 4.16263 secs total wall clock : 4 secs real 0m4.470s user 0m4.159s sys 0m0.311s $ time ./llil3m big1.txt big2.txt big3.txt >f.tmp llil3m (fixed string length=6) start get_properties CPU time : 2.30618 secs vector copy CPU time : 0.473185 secs vector sort CPU time : 1.13081 secs write stdout CPU time : 0.668702 secs total CPU time : 4.57897 secs total wall clock : 4 secs real 0m4.889s user 0m4.558s sys 0m0.331s $ time ./llil3vec big1.txt big2.txt big3.txt >f.tmp llil3vec (fixed string length=6) start get_properties CPU time : 1.46864 secs emplace set sort CPU time : 0.630914 secs write stdout CPU time : 0.847912 secs total CPU time : 2.9476 secs total wall clock time : 3 secs real 0m3.233s user 0m2.852s sys 0m0.381s $ time ./llil3grt big1.txt big2.txt big3.txt >f.tmp llil3grt (fixed string length=6) start get_properties CPU time : 2.34418 secs emplace set sort CPU time : 0.901415 secs write stdout CPU time : 0.90025 secs total CPU time : 4.14595 secs total wall clock time : 5 secs real 0m4.784s user 0m4.232s sys 0m0.551s

        Updated: The source code for llil2vec.cpp was updated after the node was first created based on feedback from the replies. Added interim version llil3vec.cpp (with some marioroy improvements added). Plus interim versions of llil3grt.cpp and llil3m.cpp. Added a few missing #include <array> (thanks marioroy).

        The good news is... The bad news is...

        The good news is that I bring good news only! :) Modified J script is faster, more versatile, uses significantly less RAM, and has been tested with 9.04 engine to parallelize obvious low hanging fruits for additional speed boost.

        NB. ----------------------------------------------------------- NB. --- This file is "llil4.ijs" NB. --- Run as e.g.: NB. NB. jconsole.exe llil4.ijs big1.txt big2.txt big3.txt out.txt NB. NB. --- (NOTE: last arg is output filename, file is overwritten) NB. ----------------------------------------------------------- pattern =: 0 1 NB. ========> This line has a star in its right margin =======> NB. * args =: 2 }. ARGV fn_out =: {: args fn_in =: }: args NB. PAD_CHAR =: ' ' filter_CR =: #~ ~: & CR make_more_space =: ' ' I. @ ((LF = ]) +. (TAB = ])) } ] find_spaces =: I. @: = & ' ' read_file =: {{ 'fname pattern' =. y text =. make_more_space filter_CR fread fname selectors =. (|.!.0 , {:) >: find_spaces text width =. # pattern height =. width <. @ %~ # selectors append_diffs =. }: , 2& (-~/\) shuffle_dims =. (1 0 3 & |:) @ ((2, height, width, 1) & $) selectors =. append_diffs selectors selectors =. shuffle_dims selectors literal =. < @: (}:"1) @: (];. 0) & text "_1 numeric =. < @: (0&".) @: (; @: (<;. 0)) & text "_1 extract =. pattern & { using =. 1 & \ or_maybe =. ` ,(extract literal or_maybe numeric) using selectors }} read_many_files =: {{ 'fnames pattern' =. y ,&.>/"2 (-#pattern) ]\ ,(read_file @:(; &pattern)) "0 fnames NB. * }} 'words nums' =: read_many_files fn_in ; pattern t1 =: (6!:1) '' NB. time since engine start 'words nums' =: (~. words) ; words +//. nums NB. * 'words nums' =: (\: nums)& { &.:>"_1 words ; nums words =: ; nums < @ /:~/. words t2 =: (6!:1) '' NB. time since engine start text =: , words ,. TAB ,. (": ,. nums) ,. LF erase 'words' ; 'nums' text =: (#~ ~: & ' ') text text fwrite fn_out erase < 'text' t3 =: (6!:1) '' NB. time since engine start echo 'Read and parse input: ' , ": t1 echo 'Classify, sum, sort: ' , ": t2 - t1 echo 'Format and write output: ' , ": t3 - t2 echo 'Total time: ' , ": t3 echo '' echo 'Finished. Waiting for a key...' stdin '' exit 0

        Code above doesn't (yet) include any 9.04 features and runs OK with 9.03, but I found 9.04 slightly faster in general. I also found 9.04 a bit faster on Windows, it's opposite to what I have seen for 9.03 (script ran faster on Linux), let's shrug it off because of 9.04 beta status and/or my antique PC. Results below are for beta 9.04 on Windows 10 (RAM usage taken from Windows Task Manager):

        > jconsole.exe llil4.ijs big1.txt big2.txt big3.txt out.txt Read and parse input: 1.501 Classify, sum, sort: 2.09 Format and write output: 1.318 Total time: 4.909 Finished. Waiting for a key... Peak working set (memory): 376,456K

        There are 3 star-marked lines. To patch for 9.04 new features to enable parallelization, replace them with these counterparts:

        {{ for. i. 3 do. 0 T. 0 end. }} '' ,&.>/"2 (-#pattern) ]\ ,;(read_file @:(; &pattern)) t.'' "0 fnames 'words nums' =: (~.t.'' words) , words +//. t.'' nums

        As you see, 1st line replaces comment, 2nd and 3d lines require just minor touches. 2nd line launches reading and parsing of input files in parallel. 3d line says to parallelize filtering for unique words and summing numbers according to words classification. Kind of redundant double work, even as it was, as I see it. The 1st line starts 3 additional worker threads. I don't have more cores with my CPU anyway, and this script has no work easily dispatched to more workers. Then:

        Read and parse input: 0.992 Classify, sum, sort: 1.849 Format and write output: 1.319 Total time: 4.16

        I would call my parallelization attempt, however crude it was, a success. Next is output for our second "official" dataset in this thread:

        > jconsole.exe llil4.ijs long1.txt long2.txt long3.txt out.txt Read and parse input: 1.329 Classify, sum, sort: 0.149 Format and write output: 0.009 Total time: 1.487

        ########################################################

        These are my results for latest C++ solution (compiled using g++), to compare my efforts to:

        $ ./llil2vec_11149482 big1.txt big2.txt big3.txt >vec.tmp llil2vec start get_properties CPU time : 3.41497 secs emplace set sort CPU time : 1.04229 secs write stdout CPU time : 1.31578 secs total CPU time : 5.77311 secs total wall clock time : 5 secs $ ./llil2vec_11149482 long1.txt long2.txt long3.txt >vec.tmp llil2vec start get_properties CPU time : 1.14889 secs emplace set sort CPU time : 0.057158 secs write stdout CPU time : 0.003307 secs total CPU time : 1.20943 secs total wall clock time : 2 secs $ ./llil2vec_11149482 big1.txt big2.txt big3.txt >vec.tmp llil2vec (fixed string length=6) start get_properties CPU time : 2.43187 secs emplace set sort CPU time : 0.853877 secs write stdout CPU time : 1.33636 secs total CPU time : 4.62217 secs total wall clock time : 5 secs

        I noticed that new C++ code, supposed to be faster, is actually slower (compared to llil2grt) with "long" dataset from two "official" datasets used in this thread.

Re: Rosetta Code: Long List is Long - JudySL code
by marioroy (Parson) on Jan 24, 2023 at 02:50 UTC

    It has been a long journey. I'm completing my participation with a JudySL array implementation (great for minimum memory consumption).

    Updated, 2023-01-30

    Ensure random data:

    This makes for a worst case scenario. The vector implementations are fast, but consume a lot of memory.

    #!/usr/bin/env perl use strict; use warnings; use List::Util 'shuffle'; my @arr = shuffle <>; print @arr;
    ./shuffle.pl big1.txt >tmp && mv tmp big1.txt ./shuffle.pl big2.txt >tmp && mv tmp big2.txt ./shuffle.pl big3.txt >tmp && mv tmp big3.txt ./shuffle.pl long1.txt >tmp && mv tmp long1.txt ./shuffle.pl long2.txt >tmp && mv tmp long2.txt ./shuffle.pl long3.txt >tmp && mv tmp long3.txt

    Running:

    The Judy demonstration consumes low memory consumption versus the others. Eventually, I was able to run parallel with it. The Judy and Map binaries process variable-length keys. Therefore, I commented out the MAX_STR_LEN_L define in llil4vec for apples-to-apples comparison. The testing done consumes one core for get_properties and parallel processing (max 8 threads) for sorting.

    $ ./llil4judy big{1,2,3}.txt long{1,2,3}.txt >out.txt llil4judy start don't use OpenMP use boost sort get properties time : 3.34859 secs finish merging time : 0.675672 secs vector stable sort time : 0.299534 secs write stdout time : 0.273699 secs total time : 4.59755 secs $ ./llil4map-std big{1,2,3}.txt long{1,2,3}.txt >out.txt llil4map start use std::map don't use OpenMP use boost sort get properties time : 11.5932 secs finish merging time : 1.81754 secs vector stable sort time : 0.286298 secs write stdout time : 0.296913 secs total time : 13.994 secs $ ./llil4map-uno big{1,2,3}.txt long{1,2,3}.txt >out.txt llil4map start use std::unordered_map don't use OpenMP use boost sort get properties time : 3.705 secs finish merging time : 1.18178 secs vector stable sort time : 0.363231 secs write stdout time : 0.267898 secs total time : 5.51796 secs $ ./llil4map-flat big{1,2,3}.txt long{1,2,3}.txt >out.txt llil4map start use phmap::parallel_flat_hash_map don't use OpenMP use boost sort get properties time : 2.27721 secs finish merging time : 0.153649 secs vector stable sort time : 0.368893 secs write stdout time : 0.26761 secs total time : 3.06742 secs $ ./llil4vec-no6 big{1,2,3}.txt long{1,2,3}.txt >out.txt llil4vec start don't use OpenMP use boost sort get properties time : 0.683441 secs sort properties time : 0.648031 secs vector stable sort time : 0.634999 secs write stdout time : 0.316875 secs total time : 2.28339 secs

    The J script solution from Anonymous Monk.

    ~/j904/bin/jconsole llil5.ijs big{1,2,3}.txt long{1,2,3}.txt out.txt Read and parse input: 2.44621 Classify, sum, sort: 4.82435 Format and write output: 2.05906 Total time: 9.32962

    Links to other llil4 demonstrations:

    llil4map
    llil4vec, llil4vec-tuple
    llil4vec-tbb

    llil4judy

      The following are test runs for 60+ input files; big.txt and long.txt. Note: See memory consumption before running llil4vec-no6-omp.

      Q. Why did I persist, personally? A. The reason is from seeing the high memory consumption of the vector demonstrations. It felt awkward to leave or stop, so to speak. What if possible, especially for Chuma and the originating Long list is long thread, to consume lesser memory. Note: Chuma states, "If it matters the original lists are sorted, they don't all contain the same words, all numbers are positive integers." For clarity, the lists are sorted by value.

      Updated, 2023-01-31

      Links to the llil(4) demonstrations and test file generators:

      The (4) variants are based on llil3vec, by eyepopslikeamosquito. The 4 series attempt parallel sort via boost, fast_io, OpenMP, JudySL, unordered_map, and parallel flat-hashmap. The tuple demonstration was something I tried as well, replacing std::pair with std::tuple.

      llil4judy, shuffle.pl
      llil4map
      llil4vec, llil4vec-tuple
      llil4vec-tbb
      gen-llil.pl, gen-long-llil.pl

      Generating input files:

      The generated data is mostly sorted by key. Shuffling a couple input files for extra randomness. hv mentioned the lists are possibly sorted by value.

      perl gen-llil.pl big1.txt 200 3 1 perl gen-llil.pl big2.txt 200 3 1 perl gen-llil.pl big3.txt 200 3 1 perl shuffle.pl big2.txt > tmp.txt mv tmp.txt big2.pl perl gen-long-llil.pl long1.txt 600 perl gen-long-llil.pl long2.txt 600 perl gen-long-llil.pl long3.txt 600 perl shuffle.pl long2.txt > tmp.txt mv tmp.txt long2.txt

      This is another way to shuffle content and write back, via the sponge utility.

      perl shuffle.pl big2.txt | sponge big2.txt

      Making additional input files (total: 30 big files):

      cp big1.txt big4.txt cp big2.txt big5.txt cp big3.txt big6.txt cp big1.txt big7.txt cp big2.txt big8.txt cp big3.txt big9.txt cp big1.txt biga.txt cp big2.txt bigb.txt cp big3.txt bigc.txt cp big1.txt bigd.txt cp big2.txt bige.txt cp big3.txt bigf.txt cp big1.txt bigg.txt cp big2.txt bigh.txt cp big3.txt bigi.txt cp big1.txt bigj.txt cp big2.txt bigk.txt cp big3.txt bigl.txt cp big1.txt bigm.txt cp big2.txt bign.txt cp big3.txt bigo.txt cp big1.txt bigp.txt cp big2.txt bigq.txt cp big3.txt bigr.txt cp big1.txt bigs.txt cp big2.txt bigt.txt cp big3.txt bigu.txt

      Running - AMD box with 8 cores, 16 threads:

      Testing with 60 files is possible simply with arguments big?.txt big?.txt twice. Ditto for long?.txt twice. This means building llil4vec with MAX_STR_LEN_L commented out. Your check-sum will differ from mine as the gen-llil.pl script generates random words.

      The C++ Parallel Hashmap container is fast. It consumes lesser memory than std::map including std::unordered_map. "By using parallel SSE2 instructions, these hashmaps are able to look up items by checking 16 slots in parallel, allowing the implementation to remain fast even when the table is filled up to 87.5% capacity."

      Most amazingly is the Judy array consuming the least amount of memory.

      Running, 600 list files:

      For this, we need more list files (now total: 60). Later, I will run with big?.txt 10 times, as arguments. Note: 60 files consume 2 GB on disk.

      I captured results for 9 and 15 threads. The append approach of the vector demonstration (llil4vec-omp - fixed length) is not suitable for this use-case, due to excessive memory consumption. The OP in the originating thread mentioned greater than 2,000 list files.

      Side note: eyepopslikeamosquito and I used the llil{3,4}vec.cpp examples to benchmark many aspects of the application. Judy array and Parallel Hashmap consume lesser memory footprint.

      Running, 1,200 list files:

      I ran with big?.txt 20 times, as arguments. The append-to-vector strategy makes llil4vec-no6-omp and llil4vec-omp unable to process more files once exceeding memory capacity.

      Running on a 32-core box, 1,200 list files:

      It turns out that llil4map-flat-omp benefits from more CPU cores, allowing it to reach llil4judy-omp. Memory bandwidth limitation may be the reason unable to run faster. This journey has showed me that there are many variables to consider. One implementation may be better than the other, depending on hardware specifications. I cannot make llil4map.cpp or llil4judy.cpp run any faster. Merging occurs immediately as threads complete get_properties.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlmeditation [id://11148465]
Approved by choroba
Front-paged by kcott
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chilling in the Monastery: (3)
As of 2023-02-01 02:16 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?