Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

Regexes eating too much RAM

by Articuno (Beadle)
on Mar 03, 2006 at 22:49 UTC ( #534384=perlquestion: print w/ replies, xml ) Need Help??
Articuno has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks!

As I was asking in CB, I have a Perl script that reads a 25 MB file to $_ (undef'ing $/) , and does a lot of regex matches against it (like a state machine).

Now I know (thanks diotalevi) that regex matches are cloning my whole string... so I'd like to ask you all some suggestions for lowering the memory consumption :-)

Below is my script:

(disclaimer - this is my full code, if need I'll update the node cutting unneeded things)

(ps - the code reads all *.ms files in the directory, but currently i'm testing with only one file)

(ps2 - the examples and text were in portuguese, but i've translated them here)

(ps3 - the stoplist is just a (very tiny) list of so called "stopwords" - words like "de","do","da" (of), etc...)

# Unify name tags # <Larry> da <Silva> <Wall> X <John> S. <Doe> # becomes # <Larry da Silva Wall> X <John S. Doe> use strict; use warnings; use diagnostics; use Storable; my $debug = 0; $" = ''; if (@ARGV < 1) { print <<INFO Usage: perl $0 [dir] Where [dir] is the directory where the .ms files are located INFO ;exit 1; } my ($dir) = @ARGV; $dir ||= './'; open SW, "/matchsimile/stoplist"; my @stopwords = <SW>; close SW; @stopwords = map { split /\s/ } @stopwords; my %stopword; @stopword{@stopwords} = undef; opendir DIR, $dir; my @files = map { $dir.'/'.$_ } grep { /\.ms$/ } sort readdir(DIR); closedir DIR; undef $/; my $caps = qr/[A-Z]/; my $texto = qr/[A-Za-z +\s]+/; my @buffer; for my $file (@files) { @buffer = (); open IN, "< $file" or die "'$file' couldn't be opened"; $_ = <IN>; close IN; open OUT, "> $file.new"; my $aux = select(OUT); $|=1; select($aux); print STDERR "Processando $file\n"; my $state = 'TEXT'; my $total_size = length($_); s/(<[^>]*)<([^<>]*>)/$1$2/g; # <<foo> <bar> baz <quux>> vira <foo +bar baz quux> study $_; /^/gc; my $tick = time; until(pos($_) == $total_size) { print STDERR sprintf "\r%10d bytes",$total_size - pos($_) if time +> $tick && ($tick = time); if ($debug) { print STDERR "[@buffer]"; print STDERR snippet(); print STDERR "\n"; } if ($state eq 'TEXT') { if (/\G(\s*<$caps[^<>]*>\s*)/gc) { $state = 'NAME'; push @buff +er, $1; next; } if (/\G<([^<>]*)>/gc || /\G([^<>\s]*\s*)/gc) { print OUT $1; +next; } die "STRANGE FOO ".snippet(); } if ($state eq 'NAME') { if (/\G(<$caps[^<>]*>\s*)/gc) { push @buffer, $1; next } if (/\G<([^<>]*)>/gc) { flush_name(); $state = 'TEXT'; print +OUT $1; next; } if (/\G(\s*([A-Z]\s*\.|(?!\s)$texto(?<!\s))\s*)/gc) { my ($token,$subtoken) = ($1,$2); if ($subtoken =~ /\b(?:x|e|ou)\b/i) { flush_name(); $state = ' +TEXT'; print OUT $token; next; } $subtoken =~ tr/ +/AAACEOEUOEaYOEecTOIunesOIuoeaIuoeaOIyoeaUEtoiaUNyoia +UAiaAoiAuA/; $subtoken =~ tr/A-Z/a-z/; if (length($subtoken) <= 2 || exists $stopword{$subtoken}) { p +ush @buffer, $token; next; } flush_name(); $state = 'TEXT'; print OUT $token; next; } if (/\G([^<>\s]+\s*)/gc) { flush_name(); $state = 'TEXT'; print OUT $1; next; } die "STRANGE FOO ".snippet(); } die "STRANGE FOO ".snippet(); } close OUT; } print STDERR "\n"; exit 0; sub snippet { my $text = "{".substr($_,pos($_),42)."}"; $text =~ s/[\r\n]/|/g; return $text; } sub flush_name { my $count = 0; for my $token (@buffer) { if ( index($token,'<') >= 0 ) { if (++$count == 2) { my $buffer = "@buffer"; $buffer =~ s/>([^<>]*)</$1/g; print OUT $buffer; @buffer = (); return; } } } for my $tk (@buffer) { $tk =~ s/[<>]//g; print OUT $tk; } @buffer = (); return; }
-- 6x9=42

Comment on Regexes eating too much RAM
Download Code
Re: Regexes eating too much RAM
by kvale (Monsignor) on Mar 04, 2006 at 04:33 UTC
    The main way to save memory is to process your files a bit at a time. Is there any structure to the file - paragraphs, blank lines, etc. - that you can use as a self contained unit of translation? Maybe there is one or a couple of names per line? Then read line, translate line, write line, repeat.

    -Mark

Re: Regexes eating too much RAM
by diotalevi (Canon) on Mar 04, 2006 at 06:08 UTC

    In general, the recipe is to eliminate all capture groups that operate on your large string. Beyond that, you can try to cut things off as you process them. Perl keeps a marker about where a string begins so if you're contientious, you can convince perl to just advance that pointer.

    This is wasteful. When it matches, it makes a copy of $_ to an internal buffer so $1 can refer back to it. Eliminate the capturing parentheses and use substr() with @- and @+ to refer back to what $1 would have contained. The documentation for @- is a good reference for you right now.

    You'll notice how I used 4-arg substr to directly replace the first part of the string.

    if (/\G<([^<>]*)>/gc) { flush_name(); $state = 'TEXT'; print OUT $1; next; }

    Efficient.

    if (/\G<[^<>]*>/gc) { flush_name(); $state = 'TEXT'; print OUT substr $_, $-[0] + 1, $+[0] - $-[0] - 1; substr $_, 0, $+[0], ''; next; }

    ⠤⠤ ⠙⠊⠕⠞⠁⠇⠑⠧⠊

Re: Regexes eating too much RAM
by duff (Vicar) on Mar 06, 2006 at 16:21 UTC

    Perl has this nifty routine called study that might be useful to you. I don't really know though, as everytime I thought it would be useful, it wasn't.

    Just thought I'd mention it anyway, in case you found it useful,

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://534384]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others about the Monastery: (12)
As of 2014-12-26 20:10 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (175 votes), past polls