Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

Re: Rosetta Code: Long List is Long

by choroba (Cardinal)
on Nov 30, 2022 at 23:25 UTC ( #11148466=note: print w/replies, xml ) Need Help??


in reply to Rosetta Code: Long List is Long

As usually, if speed is your concern, you can trade it for space.
#!/usr/bin/perl use warnings; use strict; use feature qw{ say }; warn "start\n"; my $tstart1 = time; my (%by_count, %by_word); while (<>) { chomp; my ($k, $v) = split /\t/, $_, 2; if (exists $by_word{$k}) { $v += $by_word{$k}; delete $by_count{ $by_word{$k} }{$k}; } undef $by_count{$v}{$k}; $by_word{$k} = $v; } my $tend1 = time; warn "get properties: ", $tend1 - $tstart1, " secs\n"; my $tstart2 = time; for my $count (sort { $b <=> $a } keys %by_count) { say "$_\t$count" for sort keys %{ $by_count{$count} }; } my $tend2 = time; warn "sort + output: ", $tend2 - $tstart2, " secs\n"; warn "total: ", $tend2 - $tstart1, " secs\n";

Comparison?

llil start get_properties : 13 secs sort + output : 85 secs total : 98 secs start get properties: 21 secs sort + output: 25 secs total: 46 secs
The output is identical.

map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]

Replies are listed 'Best First'.
Re^2: Rosetta Code: Long List is Long
by eyepopslikeamosquito (Bishop) on Dec 01, 2022 at 00:16 UTC

    Excellent work choroba! On my machine I see:

    get properties: 17 secs sort + output: 18 secs total: 35 secs
    more than twice as fast as my original Perl version of 85 secs ... with Windows Private Bytes increasing from 2,236,648K to 3,516,476K.

Re^2: Rosetta Code: Long List is Long -- dualvar
by marioroy (Parson) on Dec 02, 2022 at 04:07 UTC

    The following is my fun spin using dualvar, based on choroba.pl. I tried minimizing memory consumption by using one hash and one array. The memory consumption is similar to llil.pl and performance slightly faster than choroba.pl.

    See also, parallel solution update.

    #!/usr/bin/perl use warnings; use strict; use feature qw{ say }; use Scalar::Util qw{ dualvar }; warn "start\n"; my $tstart1 = time; our (%by_word, @data); while (<>) { chomp; my ($k, $v) = split /\t/, $_; $by_word{$k} += $v; } my $tend1 = time; warn "get properties: ", $tend1 - $tstart1, " secs\n"; my $tstart2 = time; while (my ($k, $v) = each %by_word) { push @data, dualvar($v, $k); } # output array of dualvars; sorted by number, string say "$_\t".(0+$_) for sort { $b <=> $a } sort { $a cmp $b } @data; my $tend2 = time; warn "sort + output: ", $tend2 - $tstart2, " secs\n"; warn "total: ", $tend2 - $tstart1, " secs\n";
      Interesting!

      What really surprised me was how slow it became once I tried to sort the data just once:

      sort { $b <=> $a || $a cmp $b } @data;
      Before:
      sort + output: 19 secs
      After:
      sort + output: 32 secs

      map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]

        What really surprised me was how slow it became once I tried to sort the data just once

        Agreed! Not sure if this is a bug or a feature but I found it very surprising! On my machine, when testing Discipulus' superb one liner just now:

        perl -lae "$r{$F[0]}+=$F[1]}{print qq($_\t$r{$_})for sort{$r{$b}<=>$r{ +$a}||$a cmp $b}keys %r" big1.txt big2.txt big3.txt >oneliner.tmp

        I almost fell off my chair when the execution time dropped from 92 seconds to 38 seconds (with identical correct output) simply by changing:

        sort{$r{$b}<=>$r{$a}||$a cmp $b}keys %r
        to:
        sort{$r{$b}<=>$r{$a}}sort{$a cmp $b}keys %r

        After picking myself up off the floor, I ran:

        perl -MO=Terse -e "sort{$r{$b}<=>$r{$a}||$a cmp $b}keys %r" LISTOP (0x299c080) leave [1] OP (0x299c050) enter COP (0x299c0c0) nextstate LISTOP (0x299c120) sort OP (0x299c160) pushmark UNOP (0x2882ed0) null LISTOP (0x2882fb0) scope COP (0x2882ff0) null [195] UNOP (0x2883050) null LOGOP (0x2883088) or BINOP (0x28831e8) ncmp [5] UNOP (0x26610d0) null [150] UNOP_AUX (0x299c010) multideref OP (0x26611b8) null [7] UNOP (0x2883228) null [150] UNOP_AUX (0x299bfd0) multideref OP (0x2661098) null [7] BINOP (0x28830c8) scmp [8] UNOP (0x2883178) null [14] PADOP (0x28831b0) gvsv GV (0x18a8c0) +*a UNOP (0x2883108) null [14] PADOP (0x2883140) gvsv GV (0x2655320) + *b UNOP (0x2882f08) keys [11] UNOP (0x2882f40) rv2hv [10] PADOP (0x2882f78) gv GV (0x2654ae0) *r

        which is 25 OPs ... while:

        perl -MO=Terse -e "sort{$r{$b}<=>$r{$a}}sort{$a cmp $b}keys %r" LISTOP (0x2940c18) leave [1] OP (0x2940c90) enter COP (0x2940bb8) nextstate LISTOP (0x2940b78) sort OP (0x2940c58) pushmark UNOP (0x2990018) null LISTOP (0x2940d38) scope COP (0x2940d78) null [195] BINOP (0x2940dd8) ncmp [5] UNOP (0x2644940) null [150] UNOP_AUX (0x298ff68) multideref OP (0x2644a28) null [7] UNOP (0x2940e18) null [150] UNOP_AUX (0x298ff28) multideref OP (0x2644908) null [7] LISTOP (0x298ffa8) sort OP (0x298ffe8) pushmark UNOP (0x2940ad0) keys [11] UNOP (0x2940b08) rv2hv [10] PADOP (0x2940b40) gv GV (0x26344d8) *r

        and (update):

        perl -MO=Terse -e "sort{$r{$b}<=>$r{$a}}sort keys %r" LISTOP (0x29e2518) leave [1] OP (0x29bda60) enter COP (0x29e2558) nextstate LISTOP (0x29e25b8) sort OP (0x29e25f8) pushmark UNOP (0x29e2628) null LISTOP (0x29e2778) scope COP (0x29e27b8) null [195] BINOP (0x29e2818) ncmp [5] UNOP (0xed31b0) null [150] UNOP_AUX (0x29bda20) multideref OP (0xed3298) null [7] UNOP (0x29e2858) null [150] UNOP_AUX (0x29bd9e0) multideref OP (0xed3178) null [7] LISTOP (0x29e2660) sort OP (0x29e26a0) pushmark UNOP (0x29e26d0) keys [8] UNOP (0x29e2708) rv2hv [7] PADOP (0x29e2740) gv GV (0xec4098) *r

        are just 20 OPs. I'm out of my depth here, so we might need a dave_the_m-class Perl internals guru to get to the bottom of this mystery. :)

        Update: I believe the second example above can be shortened from:

        sort{$r{$b}<=>$r{$a}}sort{$a cmp $b}keys %r
        to simply:
        sort{$r{$b}<=>$r{$a}}sort keys %r
        ...also, for correctness, I think we should further add:
        use sort 'stable';
        at the top of the program because we are relying on a stable sort for this trick to work. Further update: Oops, this is not necessary, didn't read far enough in the sort docs, "The default sort has been stable since v5.8.0, and given this consistent behaviour for almost two decades, everyone has come to assume stability".

        Updated: fixed typo, added -e after -MO=Terse in examples above. Added comment about stable sort.

Re^2: Rosetta Code: Long List is Long
by Anonymous Monk on Dec 01, 2022 at 01:53 UTC

    On macOS, exiting the script requires 15 additional seconds. This is a great test case for comparing "my" vs "our".

    my (%by_count, %by_word); # slow cleanup/exiting $ time perl choroba.pl big1.txt big2.txt big3.txt >cpp.tmp start get properties: 21 secs sort + output: 25 secs total: 46 secs real 1m2.072s user 1m0.182s sys 0m1.885s
    our (%by_count, %by_word); # fast cleanup/exiting $ time perl choroba.pl big1.txt big2.txt big3.txt >cpp.tmp start get properties: 21 secs sort + output: 25 secs total: 46 secs real 0m47.062s user 0m45.505s sys 0m1.549s
      What Perl version do you run? The final garbage collecting takes about 6 secs on my Linux machine in both 5.26.1 and blead.

      map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
        > What Perl version do you run?

        On macOS, Perl v5.30.2 via perlbrew. The system Perl v5.18.2 is also 15 seconds apart between "my" and "our". Garbage collection is faster on Linux.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11148466]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others lurking in the Monastery: (3)
As of 2023-02-01 02:42 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?