Assuming you have some Linux flavour as OS: could you please try the following script on your machine and tell its output?:
#!/usr/bin/perl
use warnings;
use strict;
use Devel::Size qw(size total_size);
use Encode;
my $content = decode('UTF-8', 'tralala ' x 1E6);
print size($content),"\n";
print total_size([split m{(\p{Z}|\p{IsSpace}|\p{P})}ms, $content]),"\n
+";
procinfo();
sub procinfo {
my @stat;
my $MiB = 1024 * 1024;
if (open( STAT , '<:utf8', "/proc/$$/stat")) {
@stat = split /\s+/ , <STAT>;
close STAT ;
}
else {
die "procinfo: Unable to open stat file.\n";
}
print sprintf "Vsize: %3.2f MiB (%10d\)\n", $stat[22]/$MiB, $stat[
+22];
print "RSS : $stat[23] pages\n";
}
The only difference I see, is that 32bit architecture takes half of the space as 64bit takes. But still, there is a factor of 5 between the virtual memory taken and the total size of the splitted list.
- Perl 5.8.8. on i686 (gcc 4.3.1 compiled, -march=686 -O2)
# ./tokenizer.pl
8000028
68000056
Vsize: 322.56 MiB ( 338231296)
RSS : 79087 pages
- Perl 5.8.8. on x86_64 (gcc 4.3.1 compiled, -march=core2 -O2)
# ./tokenizer.pl
8000048
112000096
Vsize: 537.61 MiB ( 563724288)
RSS : 130586 pages
- Perl 5.8.8. on x86_64 (gcc 4.1.2 compiled, -O2)
$ tokenizer.pl
8000048
112000096
Vsize: 539.42 MiB ( 565620736)
RSS : 130571 pages
So no matter what, there is always a 5-times higher memory usage than should be (30MB is perl overhead which is present even if the data is just a few bytes). Which makes me very unhappy...