Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical

Re^2: Tokenising a 10MB file trashes a 2GB machine

by PetaMem (Priest)
on Jul 16, 2008 at 12:11 UTC ( #697951=note: print w/replies, xml ) Need Help??

in reply to Re: Tokenising a 10MB file trashes a 2GB machine
in thread Tokenising a 10MB file trashes a 2GB machine

Assuming you have some Linux flavour as OS: could you please try the following script on your machine and tell its output?:

#!/usr/bin/perl use warnings; use strict; use Devel::Size qw(size total_size); use Encode; my $content = decode('UTF-8', 'tralala ' x 1E6); print size($content),"\n"; print total_size([split m{(\p{Z}|\p{IsSpace}|\p{P})}ms, $content]),"\n +"; procinfo(); sub procinfo { my @stat; my $MiB = 1024 * 1024; if (open( STAT , '<:utf8', "/proc/$$/stat")) { @stat = split /\s+/ , <STAT>; close STAT ; } else { die "procinfo: Unable to open stat file.\n"; } print sprintf "Vsize: %3.2f MiB (%10d\)\n", $stat[22]/$MiB, $stat[ +22]; print "RSS : $stat[23] pages\n"; }

The only difference I see, is that 32bit architecture takes half of the space as 64bit takes. But still, there is a factor of 5 between the virtual memory taken and the total size of the splitted list.

  • Perl 5.8.8. on i686 (gcc 4.3.1 compiled, -march=686 -O2)
    # ./ 8000028 68000056 Vsize: 322.56 MiB ( 338231296) RSS : 79087 pages
  • Perl 5.8.8. on x86_64 (gcc 4.3.1 compiled, -march=core2 -O2)
    # ./ 8000048 112000096 Vsize: 537.61 MiB ( 563724288) RSS : 130586 pages
  • Perl 5.8.8. on x86_64 (gcc 4.1.2 compiled, -O2)
    $ 8000048 112000096 Vsize: 539.42 MiB ( 565620736) RSS : 130571 pages

So no matter what, there is always a 5-times higher memory usage than should be (30MB is perl overhead which is present even if the data is just a few bytes). Which makes me very unhappy...

    All Perl:   MT, NLP, NLU

Replies are listed 'Best First'.
Re^3: Tokenising a 10MB file trashes a 2GB machine
by dave_the_m (Prior) on Jul 16, 2008 at 13:19 UTC
    On a 32-bit system, there is an approx 32 byte overhead per string (not including the string itself). Also, if, you create a list (eg with split), then eg assign it to an array, perl may temporarily need two copies of each string (plus extra space for the large temporary stack). After the assignment the temp copy will be freed for perl to reuse, but not freed to the OS (so VM usage won't shrink). Given that Devel::Size itself has a large overhead, what you are seeing looks reasonable. Consider the following code:
    my $content = decode('UTF-8', 'tralala ' x 1E6); my @a; $#a = 10_000_000; # presize array for (1..5) { print "ITER $_\n"; push @a, split m{(\p{Z}|\p{IsSpace}|\p{P})}ms, $content; procinfo(); }
    which on my system gives the following output:
    ITER 1 Vsize: 248.18 MiB ( 260235264) RSS : 62362 pages ITER 2 Vsize: 317.14 MiB ( 332550144) RSS : 80000 pages ITER 3 Vsize: 393.71 MiB ( 412839936) RSS : 99598 pages ITER 4 Vsize: 579.46 MiB ( 607612928) RSS : 147156 pages ITER 5 Vsize: 625.23 MiB ( 655597568) RSS : 158895 pages
    which averages about 94Mb growth per iteration, or 47 bytes per string pushed onto @a; allowing 32 bytes string overhead per string (SV and PV structures), leaves 15 bytes per string, which allowing for trailing \0, rounding up to a multiple of 4, malloc overhead etc etc, looks reasonable.


Re^3: Tokenising a 10MB file trashes a 2GB machine
by moritz (Cardinal) on Jul 16, 2008 at 12:22 UTC
    I have Debian GNU/Linux on a boring 32 bit i386 machine.
    perl 5.8.8: 8000028 68000056 Vsize: 322.68 MiB ( 338354176) RSS : 79112 pages perl 5.10.0: 8000036 84000100 Vsize: 270.80 MiB ( 283951104) RSS : 68365 pages

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://697951]
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others scrutinizing the Monastery: (7)
As of 2017-08-23 21:23 GMT
Find Nodes?
    Voting Booth?
    Who is your favorite scientist and why?

    Results (359 votes). Check out past polls.