Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Re^2: memory use array vs ref to array

by dkhosla1 (Sexton)
on Sep 17, 2016 at 13:41 UTC ( [id://1172001]=note: print w/replies, xml ) Need Help??


in reply to Re: memory use array vs ref to array
in thread memory use array vs ref to array

Thanks Ken (and others who responded). I will try your code and get back with the results. For others, to address 2 questions asked: - I am looking at overall process usage as at the end of the day, that is important (OS limits before it kills the process). However, in this simple test, I was not doing any processing. - I have to slurp the whole file as in the real code the processing time is high and we don't wan to leave the network file handle open for minutes and hours if possible.

Replies are listed 'Best First'.
Re^3: memory use array vs ref to array
by BrowserUk (Patriarch) on Sep 17, 2016 at 15:04 UTC

    If you need to slurp a big file and use minimum memory, slurp it as a string:

    open FILE, '<', ....; my $bigstring; do{ local $/; $bigstring = <FILE>; }; ## NOTE: not my $bigstring = do{ local $/; <FILE> }; This consumes dou +ble the memory of the above.

    You can then easily process the file line-by-line, by opening the big string as a memory file:

    open RAMFILE, '<', \$bigstring; while( <RAMFILE> { ### process in the normal way. }

    However, if your long running processing needs access to the lines as an array, then that could be a very inconvenient form for your task.

    You might be tempted to build an index into the bigstring something like this:

    my( $p, @index ) = 0; $index[ ++$p ] = tell( RAMFILE ) while <RAMFILE>; ## Then to randomly access line $n of the file my $nthLine = substr( $bigstring, $index[ $n ], $index[ $n+1 ] - $inde +x[ $n ] );

    The problem is that the index will occupy almost as much memory as your original array of lines, and you're not just back to square one, but worse off.

    However, you can build an index that occupies far less space:

    my( $p, $index ) = ( 0, "\0" ); vec( $index, ++$p, 32 ) = tell( RAMFILE ) while <RAMFILE>; ## then to randomly access line $n of the file my $nthLine = substr( $bigstring, vec( $index, $n, 32 ), vec( $index, +$n+1, 32 ) - vec( $index, $n, 32 ) );

    Putting it all together:

    #! perl -slw use strict; my $bigstring; do{ local $/; $bigstring = <> }; close ARGV; print length $bigstring; open RAMFILE, '<', \$bigstring or die $!; my( $p, $index ) = ( 0, chr(0) x ( 4 * 10e7 ) ); vec( $index, ++$p, 32 ) = tell( RAMFILE ) while <RAMFILE>; ## print the 500,000th line of the file my $n = 500,000; print substr( $bigstring, vec( $index, $n, 32 ), vec( $index, $n+1, 32 + ) - vec( $index, $n, 32 ) ); <STDIN>; ## pause to check memory size __END__ [16:02:50.49] C:\test>dir test.dat Volume in drive C is Local Disk Volume Serial Number is 8C78-4B42 Directory of C:\test 15/09/2016 23:47 1,020,000,000 test.dat 1 File(s) 1,020,000,000 bytes 0 Dir(s) 379,695,529,984 bytes free [16:02:54.35] C:\test>wc -l test.dat 10000000 test.dat [16:03:00.02] C:\test>head test.dat !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! +!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! """""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" +"""""""""""""""""""""""""""""" ###################################################################### +############################## $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$ +$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% &&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&& +&&&&&&&&&&&&&&&&&&&&&&&&&&&&&& '''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''' +'''''''''''''''''''''''''''''' (((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((( +(((((((((((((((((((((((((((((( )))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))) +)))))))))))))))))))))))))))))) ********************************************************************** +****************************** [16:03:03.22] C:\test>1171361 test.dat 1010000000 5555555555555555555555555555555555555555555555555555555555555555555555 +555555555555555555555555555555 # 1,774MB [16:03:22.93] C:\test>

    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
    In the absence of evidence, opinion is indistinguishable from prejudice.
      You don't need the do if you don't use the value of the last expression.
      my $bigstring; { local $/; $bigstring = <FILE>; }

      Update: Moreover, I tried both ways on a 2GB file and didn't notice any difference in memory consumption:

      PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ C +OMMAND 7164 choroba 20 0 1970364 1.858g 3508 R 48.00 24.04 0:00.48 p +erl 7164 choroba 20 0 1970364 1.866g 3508 S 0.990 24.15 0:00.49 p +erl 7164 choroba 20 0 1970364 1.866g 3508 S 0.000 24.15 0:00.49 p +erl 7164 choroba 20 0 1970364 1.866g 3508 S 0.000 24.15 0:00.49 p +erl 7164 choroba 20 0 1970364 1.866g 3508 S 0.000 24.15 0:00.49 p +erl 7166 choroba 20 0 1970364 1.866g 3564 S 48.51 24.15 0:00.49 p +erl 7166 choroba 20 0 1970364 1.866g 3564 S 0.000 24.15 0:00.49 p +erl 7166 choroba 20 0 1970364 1.866g 3564 S 0.000 24.15 0:00.49 p +erl 7166 choroba 20 0 1970364 1.866g 3564 S 0.000 24.15 0:00.49 p +erl 7166 choroba 20 0 1970364 1.866g 3564 S 0.000 24.15 0:00.49 p +erl

      Code run:

      ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,
        You don't need the do if you don't use the value of the last expression.

        True. But it's habitual.

        Moreover, I tried both ways on a 2GB file and didn't notice any difference in memory consumption:

        I'm not familiar with the ins and outs of memory measurement on *nix; but I can demonstrate the difference on Windows.

        C:\test>p1 [0]{} Perl> print mem;; 9,432 K []{} Perl> open I, '<', 'test.dat'; my $s; do{ local $/; $s = <I> }; p +rint mem;; 997,784 K []{} Perl> Terminating on signal SIGINT(2) C:\test>p1 [0]{} Perl> print mem;; 9,440 K []{} Perl> open I, '<', 'test.dat'; my $s = do{ local $/; <I> }; print + mem;; 1,986,060 K []{} Perl> Terminating on signal SIGINT(2)

        As you can see, in the latter case, the memory assigned to the process is double: ( 997784 - 9432 ) * 2 + 9440 = 1986146; almost exactly the 1,986,060 K measured in the latter case.

        The reason is that in the later case, the data is read into an internal mortal temporary scalar; and then copied from there to the named lexical, before the memory attached to the temp is freed. As the allocation is greater than (from memory) 1MB, (on windows at least) such huge allocations are allocated directly from the OS's virtual memory rather than from the process' heap; and then get released directly back to OS.

        Which is usually a good thing, but it still means you need to have double the memory available for a short while, and if you are close to the limits, that can blow the process.

        Is it possible that large allocations are also freed back to the OS on *nix, and you are measuring after it has been freed?


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
        In the absence of evidence, opinion is indistinguishable from prejudice.
      Will try the slurp bigstringy thing next! Thx for the hint.
Re^3: memory use array vs ref to array
by dkhosla1 (Sexton) on Sep 21, 2016 at 03:59 UTC
    Hi Ken, Follow up on test results for the 3 scenarios. I added the OS mem usage in each case (which is the real issue). The last option seems to be a little faster ( <$fh> ) but mem usage is similar to Memory::Usage shows. I am going to try the 'slurp string' as the next option as suggested by BrowserUK. Also interesting that using the reference (T2) uses 15% more RAM.
    T1: size(\@data): 177490392 total_size(\@data): 183760484 +8 PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 6569 xxxxxxx 25 0 6355m 6.0g 1592 R 100.0 51.5 0:46.67 perl T2: size($data_ref): 177490392 total_size($data_ref): 183760484 +8 PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 6603 xxxxxxx 25 0 7208m 6.9g 1592 R 100.0 58.6 0:51.38 perl T3: size($data_ref): 177490392 total_size($data_ref): 175598464 +9 PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 6625 xxxxxxx 25 0 6315m 6.0g 1592 R 100.0 51.2 0:47.73 perl

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1172001]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others romping around the Monastery: (5)
As of 2024-09-20 07:44 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    The PerlMonks site front end has:





    Results (25 votes). Check out past polls.

    Notices?
    erzuuli‥ 🛈The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.