Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

Re: Out of memory problems

by periapt (Hermit)
on Oct 21, 2004 at 12:36 UTC ( #401138=note: print w/ replies, xml ) Need Help??


in reply to Out of memory problems


I think TedPride has the right idea. Setting $INPUT_RECORD_SEPERATOR to undefined (undef $/) has the effect of causing the while(<IN> statement to slurp the entire three+ gig file at once. From the docs

"Entirely undefining $/ makes the next line input operation slurp in the remainder of the file as one scalar value"
Set $/ to some reasonable size, maybe to satisfy memory limitations. Then you can match blocks of the file in much the way TedPride describes. Something like this

#undef $/; $/ = \2048; # 2K blocks my $blocksz = 2048; open IN, "tmp"; open OUT, ">>$ARGV[1]"; #more efficient when parsing more than one b +lock #Extract the necessary data bits my $block01 = <IN>; my $block02 = ''; while ($block02 = <IN>) { my $block = $block01.$block02; $block =~ s/11110100.{8}(.{1520})11110100.{8}(.{464}).{1056}/$1$2 +/g; # $_ =~ s/11110100.{8}(.{1520})11110100.{8}(.{464}).{1056}/$1$2/g; # $final - pack("B*", $_); #Conver data back to original binary fo +rmat # $final = pack("B*", $block); # this is wrong $final = pack("B*", substr($block,0,$blocksz)); # this should wor +k print OUT "$final"; $final = ''; # this is strictly unnecssary but does keep variable + clean # undef $final; $block01 = substr($block,-$blocksz); # this moves the upper bloc +k down } $final = pack("B*", substr($block,-$blocksz)); # get final block print OUT "$final"; close OUT; close IN;

Update:

Corrected a couple lines in code

PJ
use strict; use warnings; use diagnostics;


Comment on Re: Out of memory problems
Download Code
Re^2: Out of memory problems
by tperdue (Sexton) on Oct 21, 2004 at 16:22 UTC
    I gave the updated code a shot, however I'm still getting extra data added at the block boundaries. It's as if instead of using what's left at the end of a boundary it's adding more that that chunk. Any ideas?? If you think it's better to correspond via email please send me a note at tp@g-c-i.net
      Actually, in looking at the code a second time, the problem is with the $final = pack("B*", $block); statement. It should read $final = pack("B*", substr($block,0,BLOCKSZ); Sorry about that. Please see amended code above. (I used a variable $blocksz in the code in place of BLOCKSZ in this discussion)

      $block = $block01.$block02 creates a single variable, $block, of size BLOCKSZ * 2 (4096 in my code). The substitution works across the read boundary of 2048 between blocks 01 and 02 for this one instance The substitution will fail if the pattern crosses the upper boundery of $block02 since the pattern is incomplete. Thus, after writing out $block01, you move $block02 in to $block01 so that the next pattern substition will catch any pattern that crosses that boundary. Actually, come to think of it, you should be assigning the upper BLOCKSZ of $block to $block02 ie. $block01 = substr($block,-BLOCKSZ).

      As for speed, you could increase the size of your blocks maybe to 32768 or 65536 or larger if you have the memory.

      You're using some pretty big sequences in the substitution regex, I wonder if that isn't your biggest bottleneck. Is it possible to break up your pattern into parts? You might pick up some speed there using several smaller substitutions rather than one big one. I'm not a regex guru (sort of a novice really) but it seems that there is the potential for a lot of backtracking in your regex and that has got to take time. Maybe one of the more experienced monks speak to that.

      The rest of the algorithm should be fairly quick. I would recommend that you move the file open operation open OUT, ">>tmp"; (and the related close op) out of your first loop. That will cut some overhead opening and closing a file. Pack and Unpack are pretty efficient so you probably can't squeeze any more out of thos ops. I'm not sure if this matters any but you don't have to undef $array each time in the first loop. There is a little overhead involved in reinitializing $array each time.
      Setting $array = '' will accomplish the same thing without forcing the loop to recreate $array each time through. Every little bit adds up particulary when a loop repeats tens of thousands of times.

      I'll have to try benchmarking this sometime. Maybe after work ... Update:
      Running a simple benchmark on the undef vs nullifying produced this (786500 is approx the number of reads necessary to absorb a file of ~3Gb in 4K chunks). The second option runs about 17% faster on the first test. And the second compare testing the open and close op ran over 900% faster even on a short run of 3 CPU seconds
      use strict; use warnings; use diagnostics; use Benchmark qw(cmpthese); cmpthese(-60,{a=>sub{for (0..786500){my $array = '1'; undef $array;}}, b=>sub{for (0..786500){my $array = '1'; $array = ''; }}}) +; cmpthese(0,{a=>sub{for (0..10){my $array = '1'; open OUT, ">>tmp"; print OUT "$array"; undef $array; close OUT;}}, b=>sub{open OUT, ">>tmp"; for (0..10){my $array = '1'; print OUT "$array"; undef $array;}}}); + __END__ Benchmark: running a, b, each for at least 60 CPU seconds... a: 62 wallclock secs (60.50 usr + 0.00 sys = 60.50 CPU) @ 1 +.69/s (n=102) b: 64 wallclock secs (62.31 usr + 0.00 sys = 62.31 CPU) @ 1 +.97/s (n=123) Rate a b a 1.69/s -- -15% b 1.97/s 17% -- Benchmark: running a, b, each for at least 3 CPU seconds... a: 11 wallclock secs ( 0.03 usr + 3.75 sys = 3.78 CPU) @ 2 +.12/s (n=8) b: 8 wallclock secs ( 0.00 usr + 3.14 sys = 3.14 CPU) @ 23 +.24/s (n=73) Rate a b a 2.12/s -- -91% b 23.2/s 998% --

      PJ
      use strict; use warnings; use diagnostics;
        Thanks for the help. I'll give this a go first thing in the AM. Unfortunately I can't break up the search/replace sequence. I wish I could. It would probably make my work alot easier. I'll let you know how it goes. Thanks again.
        I gave this a shot this morning. Data is still being added somehow. It looks like at the memory boundaries data is being added. It's as if instead of adding what's left at the end of the previous boundary it's adding an entire chunk. Any ideas??

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://401138]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others studying the Monastery: (10)
As of 2014-12-22 14:39 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (119 votes), past polls