Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Strings and numbers: losing memory and mind.

by kyle (Abbot)
on Sep 28, 2007 at 04:04 UTC ( #641464=perlmeditation: print w/ replies, xml ) Need Help??

I had a data file that might as well have been generated this way:

for( my $i = 0; $i < 1_000_000; $i++ ) { print join ' ', map { int rand 256 } 0 .. 50; print "\n"; }

I read my data file this way:

my @d; $#d = 1_000_000; # presize the array! my $i = 0; while (<>) { chomp; $d[$i++] = [ split ]; }

After that, I do a lot of arithmetic with the resulting array. I expected this to take a lot of memory and a long time, but I did not expect my memory usage to increase once the array was populated. Much to my dismay, my program kept growing and growing until I killed it to save its slimmer siblings.

I spent a good hour commenting out parts and printing out pieces before I got around to blaming the arithmetic, and it finally came clear to me. My huge pile of strings (hot off the text file!) was being converted to numbers.

So now my read loop has this in it:

$d[$i++] = [ map { int $_ } split ];

As a result, my program uses a lot less memory, and, more importantly, it does not grow.

Comment on Strings and numbers: losing memory and mind.
Select or Download Code
Re: Strings and numbers: losing memory and mind.
by duff (Vicar) on Sep 28, 2007 at 04:46 UTC

    If you're going to be doing arithmetic with large arrays of numbers, you really want to use PDL

Re: Strings and numbers: losing memory and mind. (SV sizes)
by tye (Cardinal) on Sep 28, 2007 at 05:33 UTC

    The point wasn't clear to me immediately. Then I got it and it reminded me of a point ysth mentioned recently: Using an integer as a string usually makes for a considerably larger (in memory) scalar than using a string as a number. So your fix will cause a much bigger problem if you actually used your numbers as both numbers and strings (except it doesn't appear to for integers). Just FYI.

    A string, "123" "1.2", likely gets stored in 4 bytes (plus the SV overhead) and then caching the numeric value adds another 8 16 bytes (or so). A number, 123 1.2, gets stored in 8 bytes (or so) and then caching the string value causes a string buffer to be allocated that is large enough to hold any stringified number and that (rather larger) buffer remains attached to the scalar (holding the stringified version of the numeric value).

    Note that these considerations usually don't matter. I'm even a little curious how much heap fragmentation played a role in your situation (since each SV has to be reallocated when the numeric value is cached, I assume).

    I have never had to resort to such tricks and, when I've needed to reduce memory footprint I've resorted to techniques that (I believe) actually have a more significant impact. Your tactic strikes me as something that is usually a waste to worry about before actually determining that it matters in the paritcular situation. A form of premature micro-optimization.

    But I'm also glad to learn of these things, just in case I do run into cases where they point to the easiest way to get enough reduction in memory usage for some practical gain.

    - tye        

      Using an integer as a string usually makes for a considerably larger (in memory) scalar than using a string as a number.

      How does this fit in with the size that Devel::Size reports? If you can trust it, a stringified number, and a numified string result in the same size (32 byte for the value 123, on a 32-bit Perl).

      use Devel::Size qw(size); use Devel::Peek; sub info { print Dump($_[0]); print "size = ",size($_[0])," ($_[1])\n\n"; } $num = 123; # or int("123") info($num, "integer"); $num .= ""; info($num, "integer stringified"); $str = "123"; info($str, "string"); $str += 0; info($str, "string numified"); $str += 45678900; info($str, "... with bigger integer"); $str .= ""; info($str, "... re-stringified");

      outputs something like

      SV = IV(0x816983c) at 0x8192124 REFCNT = 1 FLAGS = (IOK,pIOK) IV = 123 size = 16 (integer) SV = PVIV(0x8150b10) at 0x8192124 REFCNT = 1 FLAGS = (POK,pPOK) IV = 123 PV = 0x81c9f18 "123"\0 CUR = 3 LEN = 4 size = 32 (integer stringified) SV = PV(0x814fb90) at 0x81ca934 REFCNT = 1 FLAGS = (POK,pPOK) PV = 0x81950f0 "123"\0 CUR = 3 LEN = 4 size = 28 (string) SV = PVIV(0x8150b20) at 0x81ca934 REFCNT = 1 FLAGS = (IOK,pIOK) IV = 123 PV = 0x81950f0 "123"\0 CUR = 3 LEN = 4 size = 32 (string numified) SV = PVIV(0x8150b20) at 0x81ca934 REFCNT = 1 FLAGS = (IOK,pIOK) IV = 45679023 PV = 0x81950f0 "123"\0 CUR = 3 LEN = 4 size = 32 (... with bigger integer) SV = PVIV(0x8150b20) at 0x81ca934 REFCNT = 1 FLAGS = (POK,pPOK) IV = 45679023 PV = 0x81950f0 "45679023"\0 CUR = 8 LEN = 12 size = 40 (... re-stringified)

      Devel::Peek shows a comparable resulting structure for "number stringified" and "string numified" (with respect to IV and PV usage). Also, one can observe that the overall size gets larger if you make the number bigger, and then re-stringify the variable...

      Anyhow, does your comment mean that Devel::Size is not reporting the size related to the entire PV buffer allocated for the cached stringified form, but rather its currently used part only (up to and including the \0)? — which would make it a less useful tool for determining real memory usage. Actually, the size that Devel::Size reports seems to be related to the LEN in the Devel::Peek dump (which itself you can observe to increment in steps of 4, if you play around a bit). Just wondering...

        Anyhow, does your comment mean that Devel::Siz­e is not reporting the size related to the entire PV buffer allocated

        Wow, you are actually considering believing some second-hand hear-say over numbers output by a module in black-and-white? (:

        I just restated what ysth said. It made sense to me and I trust ysth but I didn't do any experiments to validate the claims. Perhaps ysth will provide some details. It certainly could be a "problem" only on a different version of Perl than what you tested on, for example. Or it may have been a misinterpretation of some data on ysth's part; after all, it was a rather casual comment and so I may have erred to elevate it to the level of a node or just misinterpretted it. We'll see what others contribute.

        Thanks for testing it.

        Looking at some source code, using an NV instead of an IV likely makes the difference (which testing shows is true on my version of Perl, allocating 36 bytes for the string "1.1", roughly doubling the size of ($x=1.1).='' over ($y='1.1')+=0; not a huge difference in most situations). The code appears to pre-construct the string then allocate/copy just the required size for an IV or UV but to allocate the buffer in the SV first when converting an NV. And based on ysth's comment, I wouldn't be surprised if the NV case has changed in some development version of Perl.

        - tye        

      Your tactic strikes me as something that is usually a waste to worry about before actually determining that it matters in the particular situation.

      I agree! Perhaps I should have prefaced my meditation by saying that this would not have been a problem I'd have needed to solve if the data set were not so large. As it was, reading 50 million strings took a little less than 5G of memory, and then it started eating up more during processing. Using 50 million ints instead took only about 2G of memory (and, of course, didn't grow). It's the difference between "just fits" and "won't work."

      Tracking this down was such a puzzle for me because I've really never had to worry about it before. Strings and numbers frolic freely together. Perl worries about the details, and I don't.

Re: Strings and numbers: losing memory and mind.
by syphilis (Canon) on Sep 28, 2007 at 08:32 UTC
    Interesting post ... it prompted me to run the following:
    use warnings; use Devel::Size qw(size); $str = "123"; $str += 0; # numify $str $num = int("123"); print size($str), " ", size($num), "\n";
    On Win32 that outputs 32 16 (56 24 on a 64-bit build of perl). I guess that's a reliable way of determining the memory consumed by both $str (32 bytes) and $num (16 bytes). Zat right ?

    Cheers,
    Rob
Re: Strings and numbers: losing memory and mind.
by oha (Friar) on Sep 28, 2007 at 13:48 UTC
    you have 1 million of undef before reading, and 1 million of string after reading.

    those strings have an average of 150-200 bytes each, so they are alot bigger then undef

    Oha

      What you write is mostly true but not very relevant.

      First, a correction: I originally wound up with 50 million strings averaging about 2.57 characters each (note that I split each line as it's read).

      The fact that I'm replacing lots of undef with lots of strings isn't the problem. I observed that process, and when it was done, I had a certain amount of memory used. The problem is that memory usage continued to grow as I operated on (but did not add to) the arrays I'd created.

      The problem is caused by the fact that Perl is converting all those strings to numbers for me on the fly. My solution was to force them to be numbers in the first place (instead of strings). Now, when I operate on the set, it stays the same size.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlmeditation [id://641464]
Front-paged by tye
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others meditating upon the Monastery: (17)
As of 2014-08-27 14:10 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The best computer themed movie is:











    Results (238 votes), past polls