Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic

Re^3: Out of memory problems

by periapt (Hermit)
on Oct 21, 2004 at 18:51 UTC ( #401270=note: print w/replies, xml ) Need Help??

in reply to Re^2: Out of memory problems
in thread Out of memory problems

Actually, in looking at the code a second time, the problem is with the $final = pack("B*", $block); statement. It should read $final = pack("B*", substr($block,0,BLOCKSZ); Sorry about that. Please see amended code above. (I used a variable $blocksz in the code in place of BLOCKSZ in this discussion)

$block = $block01.$block02 creates a single variable, $block, of size BLOCKSZ * 2 (4096 in my code). The substitution works across the read boundary of 2048 between blocks 01 and 02 for this one instance The substitution will fail if the pattern crosses the upper boundery of $block02 since the pattern is incomplete. Thus, after writing out $block01, you move $block02 in to $block01 so that the next pattern substition will catch any pattern that crosses that boundary. Actually, come to think of it, you should be assigning the upper BLOCKSZ of $block to $block02 ie. $block01 = substr($block,-BLOCKSZ).

As for speed, you could increase the size of your blocks maybe to 32768 or 65536 or larger if you have the memory.

You're using some pretty big sequences in the substitution regex, I wonder if that isn't your biggest bottleneck. Is it possible to break up your pattern into parts? You might pick up some speed there using several smaller substitutions rather than one big one. I'm not a regex guru (sort of a novice really) but it seems that there is the potential for a lot of backtracking in your regex and that has got to take time. Maybe one of the more experienced monks speak to that.

The rest of the algorithm should be fairly quick. I would recommend that you move the file open operation open OUT, ">>tmp"; (and the related close op) out of your first loop. That will cut some overhead opening and closing a file. Pack and Unpack are pretty efficient so you probably can't squeeze any more out of thos ops. I'm not sure if this matters any but you don't have to undef $array each time in the first loop. There is a little overhead involved in reinitializing $array each time.
Setting $array = '' will accomplish the same thing without forcing the loop to recreate $array each time through. Every little bit adds up particulary when a loop repeats tens of thousands of times.

I'll have to try benchmarking this sometime. Maybe after work ... Update:
Running a simple benchmark on the undef vs nullifying produced this (786500 is approx the number of reads necessary to absorb a file of ~3Gb in 4K chunks). The second option runs about 17% faster on the first test. And the second compare testing the open and close op ran over 900% faster even on a short run of 3 CPU seconds
use strict; use warnings; use diagnostics; use Benchmark qw(cmpthese); cmpthese(-60,{a=>sub{for (0..786500){my $array = '1'; undef $array;}}, b=>sub{for (0..786500){my $array = '1'; $array = ''; }}}) +; cmpthese(0,{a=>sub{for (0..10){my $array = '1'; open OUT, ">>tmp"; print OUT "$array"; undef $array; close OUT;}}, b=>sub{open OUT, ">>tmp"; for (0..10){my $array = '1'; print OUT "$array"; undef $array;}}}); + __END__ Benchmark: running a, b, each for at least 60 CPU seconds... a: 62 wallclock secs (60.50 usr + 0.00 sys = 60.50 CPU) @ 1 +.69/s (n=102) b: 64 wallclock secs (62.31 usr + 0.00 sys = 62.31 CPU) @ 1 +.97/s (n=123) Rate a b a 1.69/s -- -15% b 1.97/s 17% -- Benchmark: running a, b, each for at least 3 CPU seconds... a: 11 wallclock secs ( 0.03 usr + 3.75 sys = 3.78 CPU) @ 2 +.12/s (n=8) b: 8 wallclock secs ( 0.00 usr + 3.14 sys = 3.14 CPU) @ 23 +.24/s (n=73) Rate a b a 2.12/s -- -91% b 23.2/s 998% --

use strict; use warnings; use diagnostics;

Replies are listed 'Best First'.
Re^4: Out of memory problems
by tperdue (Sexton) on Oct 22, 2004 at 11:19 UTC
    I gave this a shot this morning. Data is still being added somehow. It looks like at the memory boundaries data is being added. It's as if instead of adding what's left at the end of the previous boundary it's adding an entire chunk. Any ideas??

      In looking more closely at your regex, it seems like you are replacing a sequence of 3072 characters with a sequence of 1984 characters. Thus if there is one replacement in $block, the statement $final = pack("B*", substr($block,0,$blocksz)); will include 1088 unchecked characters from $block02. That would explain why it appears to be adding whats left at the end of the boundary. You may have to keep track of the number of substitutions performed and then calculate how many characters you need to include in the pack statement. Maybe something like ...
      my $nrrepl = $block =~ s/11110100.{8}(.{1520})11110100.{8}(.{464} +).{1056}/$1$2/g; my $outblocksz = $blocksz - ($nrrepl * 1088); $final = pack("B*", substr($block,0,$outblocksz)); # this should +work
      You might then have to be sure that $outblocksize is a multiple of 8. It probably will be given the patterns you are working on.

      There are a couple of implicit assumptions in the code that we might examine. Is the data you are working with byte aligned and of even size? That is, is the data comprised of 32 bit integers? or does the data vary say, a 4 byte integer, followed by a 7 byte string etc? Since you are packing with 'B*' you could be introducing additional bits at the literal byte (8bit) boundary. If the data is evenly spaced, you could set BLOCKSZ to the size of your regex, that might keep everthing aligned properly.

      Another possibility is that when you change a sequence across the boundary between blocks 01 and 02, you introduce a sequence in block02. Your sequence is rather long and involved though so I rather assumed that wouldn't happen but I guess you should consider this as a fringe case.

      use strict; use warnings; use diagnostics;
        I did try this modification to the code with little luck. I'm getting the correct abount of data out but only the firest 4 chunks are correct. I did notice, after printing out the $outblocksize that a few were not multiples of 8 which shouldn't happen. There is no corruption in the data as I've ran an extremely slower piece of code on a smaller sample with the correct data being produced. I'll post that piece of code Monday since it's on my machine at work. Any ideas until then??
        open IN, "$ARGV[0]"; binmode IN; @file = <IN>; close IN; foreach $tmp1 (@file) { $array = unpack("B*", $tmp1); $final .= $array; undef $array; } undef @file undef $tmp1; $final =~ s/11110100.{8}(.{1520})11110100.{8}(.{464}).{1056}/$1$2/g; #EXTRACT USABLE DATA print "Finished extracting.\n\n"; $finalbinary = pack("B*", $final); #CONVERT BACK TO BINARY print OUT "$finalbinary"; close OUT;
Re^4: Out of memory problems
by tperdue (Sexton) on Oct 21, 2004 at 21:32 UTC
    Thanks for the help. I'll give this a go first thing in the AM. Unfortunately I can't break up the search/replace sequence. I wish I could. It would probably make my work alot easier. I'll let you know how it goes. Thanks again.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://401270]
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others perusing the Monastery: (4)
As of 2017-03-26 22:59 GMT
Find Nodes?
    Voting Booth?
    Should Pluto Get Its Planethood Back?

    Results (315 votes). Check out past polls.