Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

memory use array vs ref to array

by dkhosla1 (Sexton)
on Sep 08, 2016 at 04:30 UTC ( [id://1171361]=perlquestion: print w/replies, xml ) Need Help??

dkhosla1 has asked for the wisdom of the Perl Monks concerning the following question:

I was observing some performance issues on a 'simple' file parsing program on Linux. It seemed to be using a very high amount of memory and then getting killed. As I investigated further, two observations:

- Reading a 640 MB file into an array causes perl/OS to allocate 4.5GB in memory. Seems very high!

- Reading the same file into the same array, but via reference, makes it jump to over 6GB.

The first is too high as it is but the second is even more confounding. Why woudl just using a reference cause a memory use jump by 33%! I used Memory::Usage to dump the mem use (validated against 'top'). Ideally my 640MB file would only use 640MB ... I must be doing something obviously stupid here but can't figure it out.

#!/usr/bin/perl -w use Memory::Usage; use vars qw ($muse); $muse = Memory::Usage->new(); my $name = "bigfile"; # file size: 640MB my $READIN = qw (<); my $EXT; my @data = (); my $d = \@data; die "Error opening file: $name" if ( !open $EXT, $READIN, $name ); $muse->record('Begin'); @data = <$EXT>; # option 1 #@$d = <$EXT>; # option 2 close($EXT); $muse->record('After file read'); $muse->dump();
Test result - option 1 read into an array
$ perl test_read.pl time vsz ( diff) rss ( diff) shared ( diff) code ( diff) + data ( diff) 0 78300 ( 78300) 2120 ( 2120) 1412 ( 1412) 12 ( 12) + 1164 ( 1164) Begin 12 4538324 ( 4460024) 4373332 ( 4371212) 1428 ( 16) 12 +( 0) 4461188 ( 4460024) After file read
Test result - option 2 read into a reference to an array
$ perl test_read.pl time vsz ( diff) rss ( diff) shared ( diff) code ( diff) + data ( diff) 0 78300 ( 78300) 2116 ( 2116) 1412 ( 1412) 12 ( 12) + 1164 ( 1164) Begin 17 6461328 ( 6383028) 6296400 ( 6294284) 1428 ( 16) 12 +( 0) 6384192 ( 6383028) After file read

Replies are listed 'Best First'.
Re: memory use array vs ref to array
by NetWallah (Canon) on Sep 08, 2016 at 05:59 UTC
    According to this article, Each array element consumes a minimum of 24 bytes.

    Your 4.5 Gigs of memory will accommodate a maximum of 185,834 elements.

    How does this compare with the record count in your file ?

    More relevant is the question - WHY do you need to read the entire file into memory?

    Typical/efficient parsing handles the file one line at a time.

    IF it is necessary to store the file in memory, would it suffice to store the file in a scalar ? (You could do that by setting $/=undef; prior to reading the file).

            ...it is unhealthy to remain near things that are in the process of blowing up.     man page for WARP, by Larry Wall

      According to this article, Each array element consumes a minimum of 24 bytes. Your 4.5 Gigs of memory will accommodate a maximum of 185,834 elements.

      Respectfully, I think you are out by a factor of 1000 (or 1024) here. Should it not be around 185 million elements?

        There are 22,186,287 lines in the file so about 22M records.
Re: memory use array vs ref to array
by dasgar (Priest) on Sep 08, 2016 at 06:11 UTC

    Since you seem to want to put the contents of the file into array and are experiencing memory issues, you may want to check out Tie::File. From its description:

    The file is not loaded into memory, so this will work even for gigantic files.

    Also, you can check out the memory section of its documentation for information on controlling part of the memory usage for Tie::File.

    Tie::File might not be the best solution for you, but it's the first thing that popped into my mind when reading your question.

Re: memory use array vs ref to array
by kcott (Archbishop) on Sep 08, 2016 at 22:58 UTC

    G'day dkhosla1,

    Welcome to the Monastery.

    Firstly, I was unable to repeat your exact tests because of issues with Memory::Usage. With a little more stringent testing from the author, this module would not install on many systems (including mine: Mac OS X) — see "Bug #83323 for Memory-Usage: Mark certain OS as unsupported" (raised three and a half years ago) for more on this.

    However, I was interested in what you reported, and so ran different tests using Devel::Size. I tested the array much like you:

    my @data = <$fh>;

    I tested the arrayref in two different ways:

    my $data_ref; @$data_ref = <$fh>;

    and

    my $data_ref = [ <$fh> ];

    In ~/local/dev/test_data, I have a series of files I use for volume testing. Each consists of records of exactly 100 bytes (99 'X' characters plus a newline). They range in size from 1,000 to 10,000,000,000 bytes. I used the following for testing (a thousand, a million and a billion bytes):

    $ ls -lSr text_?_1 -rw-r--r-- 1 ken staff 1000 8 Feb 2013 text_K_1 -rw-r--r-- 1 ken staff 1000000 8 Feb 2013 text_M_1 -rw-r--r-- 1 ken staff 1000000000 8 Feb 2013 text_G_1

    Here's the test code:

    #!/usr/bin/env perl -l use strict; use warnings; use autodie qw{:all}; use Devel::Size qw{size total_size}; { open my $fh, '<', $ARGV[0]; my @data = <$fh>; print 'size(\@data): ', size(\@data); print 'total_size(\@data): ', total_size(\@data); } { open my $fh, '<', $ARGV[0]; my $data_ref; @$data_ref = <$fh>; print 'size($data_ref): ', size($data_ref); print 'total_size($data_ref): ', total_size($data_ref); } { open my $fh, '<', $ARGV[0]; my $data_ref = [ <$fh> ]; print 'size($data_ref): ', size($data_ref); print 'total_size($data_ref): ', total_size($data_ref); }

    Here's the test results:

    $ pm_1171361_mem_use_array.pl ~/local/dev/test_data/text_K_1 size(\@data): 144 total_size(\@data): 1494 size($data_ref): 144 total_size($data_ref): 1494 size($data_ref): 144 total_size($data_ref): 1494
    $ pm_1171361_mem_use_array.pl ~/local/dev/test_data/text_M_1 size(\@data): 80064 total_size(\@data): 1420366 size($data_ref): 80064 total_size($data_ref): 1420366 size($data_ref): 80064 total_size($data_ref): 1420366
    $ pm_1171361_mem_use_array.pl ~/local/dev/test_data/text_G_1 size(\@data): 80000064 total_size(\@data): 1420322314 size($data_ref): 80000064 total_size($data_ref): 1420322314 size($data_ref): 80000064 total_size($data_ref): 1420322314

    As you can see, the sizes of the variables are identical regardless of whether arrays or arrayrefs were used.

    While the variables are only a little over 40% greater than the raw data size, this doesn't take into account the memory used by the entire process (which is what you were measuring). The 1kB and 1MB tests finished almost instantaneously; the 1GB tests took about 8secs each (measured very roughly by counting in my head) and total available system memory (determined very roughly by inspection) dropped from ~3.5 GB to ~0.5GB for each run. Although a little smaller, this does appear to be at least of the same order of magnitude as you report.

    I suggest you take my test code, run it with your "bigfile", and see what results you get. I recommend that you run it at least a few times to check that you're getting consistent results.

    — Ken

      Thanks Ken (and others who responded). I will try your code and get back with the results. For others, to address 2 questions asked: - I am looking at overall process usage as at the end of the day, that is important (OS limits before it kills the process). However, in this simple test, I was not doing any processing. - I have to slurp the whole file as in the real code the processing time is high and we don't wan to leave the network file handle open for minutes and hours if possible.

        If you need to slurp a big file and use minimum memory, slurp it as a string:

        open FILE, '<', ....; my $bigstring; do{ local $/; $bigstring = <FILE>; }; ## NOTE: not my $bigstring = do{ local $/; <FILE> }; This consumes dou +ble the memory of the above.

        You can then easily process the file line-by-line, by opening the big string as a memory file:

        open RAMFILE, '<', \$bigstring; while( <RAMFILE> { ### process in the normal way. }

        However, if your long running processing needs access to the lines as an array, then that could be a very inconvenient form for your task.

        You might be tempted to build an index into the bigstring something like this:

        my( $p, @index ) = 0; $index[ ++$p ] = tell( RAMFILE ) while <RAMFILE>; ## Then to randomly access line $n of the file my $nthLine = substr( $bigstring, $index[ $n ], $index[ $n+1 ] - $inde +x[ $n ] );

        The problem is that the index will occupy almost as much memory as your original array of lines, and you're not just back to square one, but worse off.

        However, you can build an index that occupies far less space:

        my( $p, $index ) = ( 0, "\0" ); vec( $index, ++$p, 32 ) = tell( RAMFILE ) while <RAMFILE>; ## then to randomly access line $n of the file my $nthLine = substr( $bigstring, vec( $index, $n, 32 ), vec( $index, +$n+1, 32 ) - vec( $index, $n, 32 ) );

        Putting it all together:

        #! perl -slw use strict; my $bigstring; do{ local $/; $bigstring = <> }; close ARGV; print length $bigstring; open RAMFILE, '<', \$bigstring or die $!; my( $p, $index ) = ( 0, chr(0) x ( 4 * 10e7 ) ); vec( $index, ++$p, 32 ) = tell( RAMFILE ) while <RAMFILE>; ## print the 500,000th line of the file my $n = 500,000; print substr( $bigstring, vec( $index, $n, 32 ), vec( $index, $n+1, 32 + ) - vec( $index, $n, 32 ) ); <STDIN>; ## pause to check memory size __END__ [16:02:50.49] C:\test>dir test.dat Volume in drive C is Local Disk Volume Serial Number is 8C78-4B42 Directory of C:\test 15/09/2016 23:47 1,020,000,000 test.dat 1 File(s) 1,020,000,000 bytes 0 Dir(s) 379,695,529,984 bytes free [16:02:54.35] C:\test>wc -l test.dat 10000000 test.dat [16:03:00.02] C:\test>head test.dat !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! +!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! """""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" +"""""""""""""""""""""""""""""" ###################################################################### +############################## $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$ +$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% &&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&& +&&&&&&&&&&&&&&&&&&&&&&&&&&&&&& '''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''' +'''''''''''''''''''''''''''''' (((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((( +(((((((((((((((((((((((((((((( )))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))) +)))))))))))))))))))))))))))))) ********************************************************************** +****************************** [16:03:03.22] C:\test>1171361 test.dat 1010000000 5555555555555555555555555555555555555555555555555555555555555555555555 +555555555555555555555555555555 # 1,774MB [16:03:22.93] C:\test>

        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
        In the absence of evidence, opinion is indistinguishable from prejudice.
        Hi Ken, Follow up on test results for the 3 scenarios. I added the OS mem usage in each case (which is the real issue). The last option seems to be a little faster ( <$fh> ) but mem usage is similar to Memory::Usage shows. I am going to try the 'slurp string' as the next option as suggested by BrowserUK. Also interesting that using the reference (T2) uses 15% more RAM.
        T1: size(\@data): 177490392 total_size(\@data): 183760484 +8 PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 6569 xxxxxxx 25 0 6355m 6.0g 1592 R 100.0 51.5 0:46.67 perl T2: size($data_ref): 177490392 total_size($data_ref): 183760484 +8 PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 6603 xxxxxxx 25 0 7208m 6.9g 1592 R 100.0 58.6 0:51.38 perl T3: size($data_ref): 177490392 total_size($data_ref): 175598464 +9 PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 6625 xxxxxxx 25 0 6315m 6.0g 1592 R 100.0 51.2 0:47.73 perl
Re: memory use array vs ref to array
by shmem (Chancellor) on Sep 18, 2016 at 16:45 UTC

    I'm not familiar with Memory::Usage - but the sizes shouldn't differ.
    The containers of arrays are exactly the same, be they arrays declared with our, my, or anonymous arrays assigned to a (global/package scoped/pure lexical) variable.
    The only difference between these arrays is the label/accessor tacked onto the container, and the place where they are annotated (symbol table, lexical pad, variable).

    perl -le'print map{pack c,($-++?1:13)+ord}split//,ESEL'

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1171361]
Approved by kejohm
Front-paged by kcott
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having a coffee break in the Monastery: (4)
As of 2024-09-20 06:07 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    The PerlMonks site front end has:





    Results (25 votes). Check out past polls.

    Notices?
    erzuuli‥ 🛈The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.