I've long found it fun to implement the same algorithm in different languages,
especially Perl and C++ ...
and then sit back and reflect on the lessons learned ... so
when Long list is long appeared recently, I felt it was
short and interesting enough to make an excellent Rosetta code node.
Solutions to this problem must read a number of input LLiL-format files (given as command line arguments)
and write a single merged LLiL-format file to stdout.
The LLiL-format is described in the comments at the top of llil.pl below.
In the interests of keeping the code as short and fast as possible,
you may assume the input LLiL files are well-formed.
For example, you don't need to check for and remove leading and trailing whitespace on each line.
The sample solutions given below in Perl and C++ should clarify program requirements.
Please feel free to respond away with solutions to this problem in your favourite programming language
and to offer suggested improvements to my sample Perl and C++ solutions below.
What makes this problem interesting to me is the requirement
to sort the hash in descending order by value:
because the performance of such a sort may suffer when
dealing with huge files (after all, performance was the reason for the OP's
question in the first place).
I'm hoping solving this problem in multiple languages will
be fun and instructive -- and perhaps give us insight into how
performance changes as the number of items increases.
C++ Solution
Here's my C++ solution:
// llil.cpp. C++ 11 version of Perl llil.pl.
// g++ compile on Linux:
// g++ -o llil -std=c++11 -Wall -O3 llil.cpp
// This g++ command also works with mingw C++ compiler (https://source
+forge.net/projects/mingw-w64)
// that comes bundled with Strawberry Perl (C:\Strawberry\c\bin\g++.ex
+e).
// Example run: llil tt1.txt tt2.txt >out.txt
// Uncomment next line to sort by creating a multimap (instead of via
+the sort function)
// #define LLIL_SORT_VIA_MULTIMAP_L 1
#include <cstddef>
#include <cstdint>
#include <cstdlib>
#include <ctime>
#include <cstdio>
#include <string>
#include <vector>
#include <set>
#include <map>
#include <unordered_map>
#include <algorithm>
#include <utility>
#include <iostream>
#include <fstream>
#include <sstream>
// ------------------------------------------------------------
// Performance hacks to speed up IO.
// See https://www.reddit.com/r/rust/comments/9xedap/how_to_achieve_fa
+st_stdinstdout_io_suitable_for/
// Avoid flush by using "\n" instead of std::endl (this made a big dif
+ference in this program!)
// This one made almost no difference in this program:
// const auto io_speed_up =[](){
// std::ios::sync_with_stdio(false);
// std::cin.tie(nullptr);
// return nullptr;
// }();
// ------------------------------------------------------------
using str_int_type = std::pair<std::string, int>;
using map_str_int_type = std::unordered_map<std::string, int>;
using vec_str_int_type = std::vector<str_int_type>;
// Mimic the Perl get_properties subroutine
static void get_properties(
int nfiles, // in: the number of input files
char* fname[], // in: the input file names
map_str_int_type& hash_ret) // out: a hash of properties
{
for (int i = 0; i < nfiles; ++i) {
std::ifstream llfile(fname[i]);
if (!llfile) {
std::cerr << "Error opening '" << fname[i] << "'\n";
return;
}
for (std::string line; std::getline(llfile, line); ) {
std::string word, count;
std::stringstream ssline(line);
std::getline(ssline, word, '\t');
std::getline(ssline, count);
hash_ret[word] += std::stoi(count);
}
}
}
#ifdef LLIL_SORT_VIA_MULTIMAP_L
// Convert a std::unordered_map<key, value> to a std::multimap<value,
+key>
static std::multimap<int, std::string> invert_map(const std::unordered
+_map<std::string, int>& m)
{
std::multimap<int, std::string> mm;
for (std::unordered_map<std::string, int>::const_iterator it = m.be
+gin(); it != m.end(); ++it) {
mm.insert( std::make_pair(it->second, it->first) );
}
return mm;
}
#endif
int main(int argc, char* argv[])
{
if (argc < 2) {
std::cerr << "usage: llil file1 file2 ... >out.txt\n";
return 1;
}
#ifdef LLIL_SORT_VIA_MULTIMAP_L
std::cerr << "llil start (multimap version)\n";
#else
std::cerr << "llil start (sort version)\n";
#endif
time_t tstart1 = ::time(NULL);
// Create the hash of properties
map_str_int_type hash_ret;
get_properties(argc - 1, &argv[1], hash_ret);
time_t tend1 = ::time(NULL);
long taken1 = static_cast<long>(::difftime(tend1, tstart1) + 0.5);
std::cerr << "get_properties : " << taken1 << " secs\n";
// Sort descending by value, i.e. mimic this Perl code in C++:
// sort { $href->{$b} <=> $href->{$a} || $a cmp $b } keys %{$href
+}
time_t tstart2 = ::time(NULL);
#ifdef LLIL_SORT_VIA_MULTIMAP_L
std::multimap<int, std::string> newmap = invert_map(hash_ret);
for (std::multimap<int, std::string>::reverse_iterator it = newmap.
+rbegin(); it != newmap.rend(); ++it) {
std::cout << it->second << '\t' << it->first << "\n";
}
#else
vec_str_int_type v( hash_ret.begin(), hash_ret.end() );
std::sort( v.begin(), v.end(),
[](const str_int_type& left, const str_int_type& right) { return
+ right.second != left.second ? right.second < left.second : left.firs
+t < right.first; }
);
for ( const std::pair<const std::string, int>& n : v ) {
std::cout << n.first << '\t' << n.second << "\n";
}
#endif
time_t tend2 = ::time(NULL);
long taken2 = static_cast<long>(::difftime(tend2, tstart2) + 0.5);
long taken = static_cast<long>(::difftime(tend2, tstart1) + 0.5);
std::cerr << "sort + output : " << taken2 << " secs\n";
std::cerr << "total : " << taken << " secs\n";
return 0;
}
This is a straightforward translation of my Perl solution into C++ 11.
I chose a C++ std::unordered_map
because it seems the closest container to a Perl hash.
Unlike Perl, you can't sort a std::unordered_map on the fly in a one-liner.
So I tried two different ways to do it:
- Copy the std::unordered_map to a std::vector and sort it via the std::sort algorithm
- Copy the std::unordered_map<key, value> to a std::multimap<value, key> (see std::multimap)
Alternative suggestions welcome.
Testing Correctness and Measuring Performance
To get a feel for performance and correctness, I used this simple and crude Perl program gen-llil.pl:
# gen-llil.pl
# Crude program to generate a big LLiL test file to use in benchmarks
# perl gen-llil.pl big2.txt 200 3 - produces a test file with size = 3
+5,152,000 bytes
use strict;
use warnings;
use autodie;
{
my $ordmin = ord('a');
my $ordmax = ord('z') + 1;
# Generate a random word
sub gen_random_word {
my $word = shift; # word prefix
my $nchar = shift; # the number of random chars to append
for my $i (1 .. $nchar) {
$word .= chr( $ordmin + int( rand($ordmax - $ordmin) ) );
}
return $word;
}
}
sub create_test_file {
my $fname = shift;
my $count = shift;
my $wordlen = shift;
open( my $fh_out, '>', $fname );
for my $c ( 'aaa' .. 'zzz' ) {
for my $i (1 .. $count) {
print {$fh_out} gen_random_word( $c, $wordlen ) . "\t" . 1 .
+"\n";
}
}
}
my $outfile = shift;
my $count = shift;
my $wordlen = shift;
$outfile or die "usage: $0 outfile count wordlen\n";
$count or die "usage: $0 outfile count wordlen\n";
print "generating test file '$outfile' with count '$count'\n";
create_test_file($outfile, $count, $wordlen);
print "file size=", -s $outfile, "\n";
I generated three biggish test files (size 35,152,000 bytes) with:
perl gen-llil.pl big1.txt 200 3
perl gen-llil.pl big2.txt 200 3
perl gen-llil.pl big3.txt 200 3
then ran the Perl version:
perl llil.pl big1.txt big2.txt big3.txt >perl.tmp
and the C++ version:
llil big1.txt big2.txt big3.txt >cpp.tmp
and verified they produced the same result:
diff perl.tmp cpp.tmp
Performance Results
To get the ball rolling, here are my initial performance results for the Perl and C++ solutions given above
running Strawberry Perl on my Windows Laptop PC with 8 GB RAM.
Perl (max Windows Private Bytes for the Perl process was about 2,236,648K)
> perl llil.pl big1.txt big2.txt big3.txt >perl.tmp
llil start
get_properties : 11 secs
sort + output : 74 secs
total : 85 secs
C++ (max Windows Private Bytes for the llil process was about 1,176,048K)
> llil big1.txt big2.txt big3.txt >cpp.tmp
llil start (sort version)
get_properties : 9 secs
sort + output : 7 secs
total : 16 secs
Verify correctness:
> diff cpp.tmp perl.tmp
I'd also find it interesting and educational to compare solutions to this little problem in a variety of languages.
So please feel free to respond away in your favourite language!
Some Other Rosetta Code Nodes
References Added Later
Updated: Fixed a commented out for debugging line in llil.pl one minute after posting. :)
Mar 2023: added link to follow-up Rosetta Test: Long List is Long.