|Problems? Is your data what you think it is?|
Tokenising a 10MB file trashes a 2GB machineby PetaMem (Priest)
|on Jul 16, 2008 at 08:48 UTC||Need Help??|
PetaMem has asked for the
wisdom of the Perl Monks concerning the following question:
it seems I have - again - stumbled across some example of Perls "obscene memory consumption habits". Basically I try to tokenize a 10MB file in memory and when it crashed my computer I gave it a closer look:
Take emails (simple text, no html, no attachements) concat them to a 10MB file, then do something like
using Devel::Size to determine who is the culprit gives the numbers 10485544 (file size) and 370379304 (result of split). While the two numbers are within expectation, the script takes more than 1,8GB RAM before being able to print out the second number. Which I think is somewhat insane. It's 64bit 5.8.8 on x86_64 arch.
Of course I am aware of String::Tokenizer and other iterative approaches to tokenizing tasks. I would just want to know from someone who is more knowledgeable of Perls interna why there is a *hidden* memory consumption by a factor of 5 that is not explainable to me. Is it something special with split? Some wild copying happening?