Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

Multi-line Regex Performance

by pboin (Deacon)
on Nov 01, 2005 at 15:35 UTC ( #504605=perlquestion: print w/replies, xml ) Need Help??

pboin has asked for the wisdom of the Perl Monks concerning the following question:

I've got a rather large (327M) file of multiple line records to process. I'm picking up each record, pulling the key out, and then looking into a hash to see if it's a 'keeper' or not.

Records look like this:

##REPORT 01 A ARCUR1GMU# 00112106 F N ARCUR1 ARCU +R1 P 02 AR CURRENCY CONVERSION CONTROL - (BY PTR) + N 03 FROM FVENDAP1 04 CODE: 10 RECORDING & REPORTING GMU / NON-WHQ ONLY RE +PORTS 07 00112704 00200103 11 R 3 RD.RDSMODEL C 00 +08 0001 12 A 0007 000 000 00 +0 000 13 VDLTP02 14 S N VDLTB02 ##REPORT 01 A ARCUR10000 00112106 F N ARCUR1 ARCU +R1 P 02 AR CURRENCY CONVERSION CONTROL - (BY PTR) + N 07 00112704 00200103 11 R 3 RD.RDSMODEL C 00 +08 0001 12 A 0007 000 000 00 +0 000 13 VDLTP02 14 S N VDLTB02

I've got a regex that technically works, but the performance is very poor (over 1 sec. per record). I wondered about slurping the whole file in, but I'm not swaping at all, so that 327M is all in memory -- shouldn't be a problem, right?. My best hunch is that the regex is smarter than I am, and it is doing some heavy-duty backtracking that I'm not understanding.

I think the regex is starting to capture at double-hashes. Then, continue non-greedy matching anything (including newline), until a positive-lookahead of either a) more double-hashes, denoting a new record or b) EOF.

My code looks like this:

#!/usr/bin/perl use strict; # input file of ##REPORT cards my $str = do {local $/ = undef; <STDIN>}; # keepers my %keeplist; open (RIDS, '<', 'rids_that_have_versions.txt') or die; while (<RIDS>) { chomp; $keeplist{$_} = 1;} my @cards; while ( $str =~ /(##.*?)(?=(##|\Z))/gs ) { my $rid = substr($1, 19, 10); print "$rid: " . ($keeplist{$rid} ? 'y' : 'n') . "\n"; }

Update:

This is definitely a non-linear problem...
LinesSeconds
20k7 s.
40k25 s.
80k102 s.

Another Update:

Prompted by some insightful responses, I decided to minimize my dependence on regex in this case. I decided to buffer the file manually, and got the 80k line test time down to just over 1.2 seconds.

while (<STDIN>) { if ( /^##/ ) { my $rid = substr($buffer, 19, 10); $rid =~ s/\s*$//g; print $buffer if $keeplist{$rid}; $buffer = ''; } $buffer .= $_; }

YAU:

The performance problem w/ the regex is directly addressed and explained quite well by TheDamian in his must-read book Perl Best Practices. I almost want to keep the secret 'cause I feel so strongly that every perl programer should have this book. But... The extenstive tracking and back-tracking from .* seems to be the problem. Read about it in 'Unconstrained Repetitions' on p. 250.

Thanks for taking a look...

Replies are listed 'Best First'.
Re: Multi-line Regex Performance
by BrowserUk (Pope) on Nov 01, 2005 at 15:52 UTC

    You could avoid the regex engine entirely, and also the risk of swapping should your file(s) get bigger, by setting $/ = "##"; and reading the file one multi-line record at a time:

    #! perl -slw use strict; while( $_ = do{ local $/ = "##REPORT"; <DATA> } ) { next if $. == 1; ##skip first empty record. print substr $_, 11, 10; } __DATA__ ##REPORT 01 A ARCUR1GMU# 00112106 F N ARCUR1 ARCU +R1 P 02 AR CURRENCY CONVERSION CONTROL - (BY PTR) + N 03 FROM FVENDAP1 04 CODE: 10 RECORDING & REPORTING GMU / NON-WHQ ONLY RE +PORTS 07 00112704 00200103 11 R 3 RD.RDSMODEL C 00 +08 0001 12 A 0007 000 000 00 +0 000 13 VDLTP02 14 S N VDLTB02 ##REPORT 01 A ARCUR10000 00112106 F N ARCUR1 ARCU +R1 P 02 AR CURRENCY CONVERSION CONTROL - (BY PTR) + N 07 00112704 00200103 11 R 3 RD.RDSMODEL C 00 +08 0001 12 A 0007 000 000 00 +0 000 13 VDLTP02 14 S N VDLTB02

    This PBP-rated -4 code example produces

    P:\test>junk2 ARCUR1GMU# ARCUR10000

    And should run considerably more quickly.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Multi-line Regex Performance
by japhy (Canon) on Nov 01, 2005 at 15:45 UTC
    Your regex is awfully slow. That's what it comes down to. You're inching forward character-by-character until you come to '##' or the end of the string. The regex could be made amazingly faster by just making this one change:
    $str =~ /(##(?:.*\n)*)(?=##|\Z)/g
    That reads "lines" at a time until it gets to '##' or \Z. Another way you could write it (to be very fast) is:
    $str =~ /(##(?:[^#]+|#[^#])*/g
    That doesn't even need the look-ahead.

    Jeff japhy Pinyan, P.L., P.M., P.O.D, X.S.: Perl, regex, and perl hacker
    How can we ever be the sold short or the cheated, we who for every service have long ago been overpaid? ~~ Meister Eckhart
Re: Multi-line Regex Performance
by japhy (Canon) on Nov 01, 2005 at 16:00 UTC
    Here's another way to do the task which doesn't require reading the entire file into memory.
    #!/usr/bin/perl use strict; use warnings; my %keeplist; open my $rids, '<', 'rids_that_have_versions.txt' or die $!; while (<$rids>) { chomp; $keeplist{$_} = 1;} close $rids; while (<STDIN>) { next unless /^##/; # only deal with '##' lines my $rid = substr($_, 19, 10); print "$rid: " . ($keeplist{$rid} ? 'y' : 'n') . "\n"; }

    Jeff japhy Pinyan, P.L., P.M., P.O.D, X.S.: Perl, regex, and perl hacker
    How can we ever be the sold short or the cheated, we who for every service have long ago been overpaid? ~~ Meister Eckhart
Re: Multi-line Regex Performance
by ikegami (Pope) on Nov 01, 2005 at 15:48 UTC

    $str =~ /\G(##.*?)(?=(##|\Z))/gs
    might be faster, but it's probably the same.

    $str =~ /\G(##(?:(?!$##).)*)/gs
    is something to try.

    Taking advantage of the fact that '##' is at the begining of a line might help greatly.

    But you don't even need a regexp. Use index, which is lightyears faster:

    my $sep = '##'; my $sep_len = length($sep); my $last_pos = index($str, $sep); for (;;) { my $pos = index($str, '##', $last_pos + $sep_len); my $rec; if ($pos < 0) { $rec = substr($str, $last_pos); last if not length $rec; } else { $rec = substr($str, $last_pos, $pos-$last_pos); $last_pos = $pos; } my $rid = substr($rec, 19, 10); print "$rid: " . ($keeplist{$rid} ? 'y' : 'n') . "\n"; }
    You can even avoid copying stuff into $rec:
    my $sep = '##'; my $sep_len = length($sep); my $last_pos = index($str, $sep); for (;;) { my $pos = index($str, '##', $last_pos + $sep_len); my $rec_pos = $last_pos; if ($pos < 0) { last if $last_pos == length($str); $last_pos = length($str); } else { $last_pos = $pos; } my $rid = substr($str, $rec_pos+19, 10); print "$rid: " . ($keeplist{$rid} ? 'y' : 'n') . "\n"; }

    Update: Added code. Untested.

Re: Multi-line Regex Performance
by sauoq (Abbot) on Nov 01, 2005 at 15:45 UTC

    Can you tell us what is supposed to be the hash key in this input? It's impossible to tell by reading the code without knowing if there are tabs in your whitespace. And, are these fixed length records? (Or even a fixed number of lines?)

    If there are no hashes in the records except the doubled ones, your regex could probably be as simple as /(##[^#]*)/g. (And it would be a lot faster.)

    -sauoq
    "My two cents aren't worth a dime.";
    

      The key for checking whether one of the multi-line records is a keeper or not is at position 19 for a length of ten on the first line (segment '01' in our parlance.)

      There could be hash characters in some of the fields -- many of them are freeform and take addresses, comments, etc. There will not be any in the first position though, other than the ones that denote new records. Thanks sauoq.

        The key for checking whether one of the multi-line records is a keeper or not is at position 19 for a length of ten on the first line

        This doesn't really tell us anything that the substr() hadn't already told us. The thing is, we can't reliably count whitespace in the data you provided.

        But anyway, since what you want is always on that first line, it's probably a lot easier (and more efficient than using a regular expression) to just read the data line by line ignoring lines that don't match /^##/ and doing what you want with the ones that do. This would have the added benefit of not keeping 300+MB in RAM.

        while (<>) { next unless /^##/; my $key = substr $_, 19, 10; do_stuff_with($key); }

        -sauoq
        "My two cents aren't worth a dime.";
        

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://504605]
Approved by tinita
Front-paged by perlfan
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others surveying the Monastery: (3)
As of 2021-07-23 23:12 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?