Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change

perl performance vs egrep

by dba (Monk)
on Jan 23, 2005 at 13:38 UTC ( #424382=perlquestion: print w/replies, xml ) Need Help??
dba has asked for the wisdom of the Perl Monks concerning the following question:

Iam doing a basic pattern search of about 20GB text file. (split into 10 text files each of 2GB). I use egrep (since I have multiple patterns, using | option). It is slow. Before I try to do the same in perl to do benchmarking, I request the wisdom of esteemed monks, if you have any insight into solving similar issues. Thanks, dba
I do use 10 parallel egrep to search 2 GB each. But for a single process it takes 35 minutes. Non-perl question: Is there a way for me to tell perl or egrep to use more memory than it normally uses. I have a lot of memory on the server.
The text files are not compressed. This is my sample perl code to benchmark timings:
use strict; open(OFILE,">output.txt") or die "Unable to open output file"; open(IFILE,"<input.txt") or die "Unable to open input file"; while ( <IFILE>) { print OFILE unless /^CPXX1|^KLXX1|^KMXX1|^MEXX1|^PAXX1|^PMXX1|^SLXX1| +^SZXX1|^WXXX1|^YZXX1/ } close(IFILE); close(OFILE);

Any suggestions to tune this?
Thank you monks for your great suggestions. I did some benchmarking. Instead of working with 2GB file, I split the text file to the first 1 million rows (approximately 300MB) and used it to benchmark.
I also modified the regex to  /^(CP|KL|KM|ME|PA|PM|SL|SZ|WX|YZ)XX1/
1. for cat to just copy, it took 5 seconds
2. for perl it took 14 seconds
3. for egrep it took 22 seconds
(Both perl and egrep used the same regex as above. I actually ran the tests twice to avoid any discrepencies)
The change in regex made a world of difference in both egrep and perl.
perl performed better than egrep to my surprise.
Thanks again for all your great suggestions. I will try demerphq's solution and update with the results.

Replies are listed 'Best First'.
Re: perl performance vs egrep
by holli (Monsignor) on Jan 23, 2005 at 15:44 UTC
    Your regex can be optimized a little. This:
    should be faster than the one you provided.

    holli, regexed monk
      base on my "programming perl" 2nd edition P538, things like

      if /one-hump/ || /two/; is likely to be faster than: if /one-hump|two/;
      so you may speed it up with this sorta regex.

        Sure that is somewhat common advice. But it doesnt apply here. Consider the regex holli posted:


        This allows Perls regex algorithm to exploit a number of optimizations that aren't available in your variant. First, the regex has a common constant string 'XX1' so the regex engine knows it cant match unless it finds that first (and once found it knows where to start looking to see if its a match). Second the algorithm is anchored at the start which means the search is constrained to one spot. It won't traverse the options more than once per line.

        The trick you showed actually allows the constant string match to work, but is still probably a touch slower than using two index calls.


Re: perl performance vs egrep
by Aristotle (Chancellor) on Jan 23, 2005 at 15:49 UTC

    egrep will be faster than Perl here. ambrus is right on as well, this is an issue where I/O will limit you, not CPU.

    In any case, if you want to use Perl, always check the regex compiler's decisions using the re pragma when speed matters.

    In this case you can accelerate the regex by pulling the anchor out of the alternation. Pay attention to how the compiler only says “minlen 5” for the first version (it inferred that a string with less than 5 characters cannot possibly match), while with the second with pulled out anchor it found “anchored(BOL) minlen 5” (the regex is also anchored at the beginning of the string). The regex engine will actually apply the first pattern at every character position of a line (except the last four, where it knows there are not enough characters to satisfy the pattern), while it will only bother to apply the second regex at the beginning of the line because it knows it cannot match anywhere else.

    Sometimes just a very subtle change in a pattern can trigger or defeat such optimizations, so be sure to check when you need to eke out every last ounce of speed. And then benchmark it when an optimization or lack thereof is not a sure win or loss; sometimes it is not obvious how certain decisions on the compiler's part will affect performance on different kinds of data sets.

    With egrep using a DFA-based matcher on the other hand there is absolutely no difference between different ways of writing the same pattern.

    $ perl -Mre=debug -e'/^CPXX1|^KLXX1|^KMXX1|^MEXX1|^PAXX1|^PMXX1|^SLXX1 +|^SZXX1|^WXXX1|^YZXX1/' Compiling REx `^CPXX1|^KLXX1|^KMXX1|^MEXX1|^PAXX1|^PMXX1|^SLXX1|^SZXX1 +|^WXXX1|^YZXX1' size 51 Got 412 bytes for offset annotations. 1: BRANCH(6) 2: BOL(3) 3: EXACT <CPXX1>(51) 6: BRANCH(11) 7: BOL(8) 8: EXACT <KLXX1>(51) 11: BRANCH(16) 12: BOL(13) 13: EXACT <KMXX1>(51) 16: BRANCH(21) 17: BOL(18) 18: EXACT <MEXX1>(51) 21: BRANCH(26) 22: BOL(23) 23: EXACT <PAXX1>(51) 26: BRANCH(31) 27: BOL(28) 28: EXACT <PMXX1>(51) 31: BRANCH(36) 32: BOL(33) 33: EXACT <SLXX1>(51) 36: BRANCH(41) 37: BOL(38) 38: EXACT <SZXX1>(51) 41: BRANCH(46) 42: BOL(43) 43: EXACT <WXXX1>(51) 46: BRANCH(51) 47: BOL(48) 48: EXACT <YZXX1>(51) 51: END(0) minlen 5 Offsets: [51] 0[0] 1[1] 2[5] 0[0] 0[0] 7[1] 8[1] 9[5] 0[0] 0[0] 14[1] 15[1] 16[5 +] 0[0] 0[0] 21[1] 22[1] 23[5] 0[0] 0[0] 28[1] 29[1] 30[5] 0[0] 0[0] 3 +5[1] 36[1] 37[5] 0[0] 0[0] 42[1] 43[1] 44[5] 0[0] 0[0] 49[1] 50[1] 51 +[5] 0[0] 0[0] 56[1] 57[1] 58[5] 0[0] 0[0] 63[1] 64[1] 65[5] 0[0] 0[0] + 70[0] Freeing REx: `"^CPXX1|^KLXX1|^KMXX1|^MEXX1|^PAXX1|^PMXX1|^SLXX1|^SZXX1 +|^WX"......' $ perl -Mre=debug -e'/^(?:CPXX1|KLXX1|KMXX1|MEXX1|PAXX1|PMXX1|SLXX1|SZ +XX1|WXXX1|YZXX1)/' Compiling REx `^(?:CPXX1|KLXX1|KMXX1|MEXX1|PAXX1|PMXX1|SLXX1|SZXX1|WXX +X1|YZXX1)' size 43 Got 348 bytes for offset annotations. first at 2 1: BOL(2) 2: BRANCH(6) 3: EXACT <CPXX1>(43) 6: BRANCH(10) 7: EXACT <KLXX1>(43) 10: BRANCH(14) 11: EXACT <KMXX1>(43) 14: BRANCH(18) 15: EXACT <MEXX1>(43) 18: BRANCH(22) 19: EXACT <PAXX1>(43) 22: BRANCH(26) 23: EXACT <PMXX1>(43) 26: BRANCH(30) 27: EXACT <SLXX1>(43) 30: BRANCH(34) 31: EXACT <SZXX1>(43) 34: BRANCH(38) 35: EXACT <WXXX1>(43) 38: BRANCH(42) 39: EXACT <YZXX1>(43) 42: TAIL(43) 43: END(0) anchored(BOL) minlen 5 Offsets: [43] 1[1] 4[1] 5[5] 0[0] 0[0] 10[1] 11[5] 0[0] 0[0] 16[1] 17[5] 0[0] 0[ +0] 22[1] 23[5] 0[0] 0[0] 28[1] 29[5] 0[0] 0[0] 34[1] 35[5] 0[0] 0[0] +40[1] 41[5] 0[0] 0[0] 46[1] 47[5] 0[0] 0[0] 52[1] 53[5] 0[0] 0[0] 58[ +1] 59[5] 0[0] 0[0] 63[0] 65[0] Freeing REx: `"^(?:CPXX1|KLXX1|KMXX1|MEXX1|PAXX1|PMXX1|SLXX1|SZXX1|WXX +X1|Y"......'

    Makeshifts last the longest.

      This version looks to give the optimiser even more hints anchored `XX1' at 2 (checking anchored) anchored(BOL). Now if only we had some test data to try the Sys::Mmap optimisation and demerphq's++ nice inversion of the problem at 424532

      nph>perl -Mre=debug -e'/^(CP|K[LM]|ME|P[AM]|S[LZ]|WX|YZ)XX1/' Freeing REx: `","' Compiling REx `^(CP|K[LM]|ME|P[AM]|S[LZ]|WX|YZ)XX1' size 62 Got 500 bytes for offset annotations. first at 2 1: BOL(2) 2: OPEN1(4) 4: BRANCH(7) 5: EXACT <CP>(58) 7: BRANCH(21) 8: EXACT <K>(10) 10: ANYOF[LM](58) 21: BRANCH(24) 22: EXACT <ME>(58) 24: BRANCH(38) 25: EXACT <P>(27) 27: ANYOF[AM](58) 38: BRANCH(52) 39: EXACT <S>(41) 41: ANYOF[LZ](58) 52: BRANCH(55) 53: EXACT <WX>(58) 55: BRANCH(58) 56: EXACT <YZ>(58) 58: CLOSE1(60) 60: EXACT <XX1>(62) 62: END(0) anchored `XX1' at 2 (checking anchored) anchored(BOL) minlen 5 Offsets: [62] 1[1] 2[1] 0[0] 2[1] 3[2] 0[0] 5[1] 6[1] 0[0] 7[4] 0[0] 0[0] 0[ +0] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 11[1] 12[2] 0[0] 14[1] 15[1] 0[ +0] 16[4] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 20[1] 21[1 +] 0[0] 22[4] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 26[1] +27[2] 0[0] 29[1] 30[2] 0[0] 32[1] 0[0] 33[3] 0[0] 36[0] Freeing REx: `"^(CP|K[LM]|ME|P[AM]|S[LZ]|WX|YZ)XX1"'


      as the OP says using egrep is a valid solution he must not need to capture any parts of the match so lets make this non-capturing and revist the debug, this time with added interactivity.
      nph>perl -Mre=debug -ne'/^(?:CP|K[LM]|ME|P[AM]|S[LZ]|WX|YZ)XX1/;print" +>"' Freeing REx: `","' Compiling REx `^(?:CP|K[LM]|ME|P[AM]|S[LZ]|WX|YZ)XX1' size 59 Got 476 bytes for offset annotations. first at 2 1: BOL(2) 2: BRANCH(5) 3: EXACT <CP>(57) 5: BRANCH(19) 6: EXACT <K>(8) 8: ANYOF[LM](57) 19: BRANCH(22) 20: EXACT <ME>(57) 22: BRANCH(36) 23: EXACT <P>(25) 25: ANYOF[AM](57) 36: BRANCH(50) 37: EXACT <S>(39) 39: ANYOF[LZ](57) 50: BRANCH(53) 51: EXACT <WX>(57) 53: BRANCH(56) 54: EXACT <YZ>(57) 56: TAIL(57) 57: EXACT <XX1>(59) 59: END(0) anchored `XX1' at 2 (checking anchored) anchored(BOL) minlen 5 Offsets: [59] 1[1] 4[1] 5[2] 0[0] 7[1] 8[1] 0[0] 9[4] 0[0] 0[0] 0[0] 0[0] 0[ +0] 0[0] 0[0] 0[0] 0[0] 0[0] 13[1] 14[2] 0[0] 16[1] 17[1] 0[0] 18[4] 0 +[0] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 22[1] 23[1] 0[0] 24[ +4] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 28[1] 29[2] 0[0] + 31[1] 32[2] 0[0] 33[0] 35[3] 0[0] 38[0] >SLXC1 Guessing start of match, REx `^(?:CP|K[LM]|ME|P[AM]|S[LZ]|WX|YZ)XX1' a +gainst `SLXC1 '... String not equal... Match rejected by optimizer >

      If you knew the frequency of each of the pre-fixes you would optimise by putting the most common first, or perhaps if there were a vast number of lines with something you certainly did not want (e.g. RWXX1) then you may even add a [^R] in there right at the start.


      Pereant, qui ante nos nostra dixerunt!
Re: perl performance vs egrep
by ambrus (Abbot) on Jan 23, 2005 at 14:25 UTC

    I belive that the bottleneck in such a search is probably disk access. When I grep through between 10M and 100M of data, it seems that it's always much faster if I rerun the search, as the files are cached in memory at that time. Egrep is very fast, as it builds a dfa, so it's probably faster than perl; however, as you probably can't hold all 20G of data in the memory, the difference will be probably negligable.

    Update: as a consequence, if you want to make grepping fast, you'll have to make disk access faster; make sure dma is set up correctly, put the 10 files on different disks if you can, copy the data to hard disk if they are on cd or dvd or nfs now. Also, do multiple searces at once if you can. (You're not zgrepping compressed files, are you? That would slow things down.)

    Update 2: as for your update: to tell how much memory they are using, use top or ps if this is on a unix box.

Re: perl performance vs egrep
by grinder (Bishop) on Jan 23, 2005 at 16:29 UTC
    Any suggestions to tune this?

    Cool. I can offer a suggestion that uses a module I wrote. Regexp::Assemble is designed to optimise expressions like this. Consider the following:

    use Regexp::Assemble; my $r = Regexp::Assemble->new; while( <DATA> ) { chomp; $r->add( $_ ); } print $r->as_string; # produces ^(?:K[LM]|P[AM]|S[LZ]|CP|ME|WX|YZ)XX1 __DATA_ ^CPXX1 ^KLXX1 ^KMXX1 ^MEXX1 ^PAXX1 ^PMXX1 ^SLXX1 ^SZXX1 ^WXXX1 ^YZXX1

    You can also do this as a one liner:

    print Regexp::Assemble ->new( chomp=> 1 ) ->add( <DATA> ) ->as_string;

    The other thing you can do is take the assembled expression, remove the ?: to turn it into a POSIX-compatible expression, and feed that to egrep. (see Aristotle's words of wisdom below).

    update: hmm, I see that this produces the same thing that other posters have done by hand. Bear in mind that R::A is more at home when dealing with hundreds or thousands of discrete expressions: that is where it starts to shine. With 10 expressions it's like a car in first gear. Still, if your real-world environment is using more than you show here, you will see a definite improvement.

    Another thing it's good at is when your expressions aren't so, um, regular. Consider what happens when PMXX1 is changed to PMXX2. (Hint: ^(?:(?:K[LM]|S[LZ]|CP|ME|WX|YZ)XX1|P(?:AXX1|MXX2))). That's not quite as easy to work out by hand.

    - another intruder with the mooring in the heart of the Perl

      The other thing you can do is take the assembled expression, remove the ?: to turn it into a POSIX-compatible expression, and feed that to egrep.

      Except that makes zero difference because egrep uses a DFA engine (as opposed to Perl's NFA.) To a DFA engine it doesn't matter which of any number of equivalent regexen you use. So long as they all match the exact same things, all of them will be translated to the exact same state machine.

      Makeshifts last the longest.

      I just thought id mention that assuming you are talking about regexes of the form /^(LIST|OF|LITERALS)/ (ie no regex special characters involved and left anchored at the start of the string) then once you get over a small handful of words (last time i checked it was around 50 or so) you can actually outperform perls regex engine with a pure perl trie. Perl really doesnt handle this type of pattern very well currently, and as a pet project im working on creating two new regops, TRIE and DFA which will basically do all of this type of optimization automatically. (Which will have the disadvantage that it will make regexs like your optimized one actually perform worse than a straight list of options.) Incidentally you will probably find that if you omit the class logic the regex will run faster. As someone else mentioned already, classes disable a lot of optimizations that the engine can do.


Re: perl performance vs egrep
by exussum0 (Vicar) on Jan 23, 2005 at 13:55 UTC
    Let's step back and discuss what you'd like to happen. You are looking for some arbitrary text in 10 files. When you do a basic search for a file using grep, some data is slurped into memory from disc (usually) and then the text is searched for in memory.

    If you can split this into many jobs, even using 10 egrep processes, you may be able to do better. In any modern OS, when one process is using a disc, the cpu is kinda idle for that moment. Another process could grab the cpu for that time and "do stuff".

    Have you tried running searches in parallel? I doubt, though I wouldn't rule it out, that egrep doesn't do multiprocess/threaded searching.

    Give me strength for today.. I will not talk it away..
    Just for a moment.. It will burn through the clouds.. and shine down on me.

      I wouldn't think threaded searching would help. The operating system knows that many programs read file sequentially, thus, if you read some part of a file from disk, the OS will probably read further if the disk is free, so that the program can access the rest of it faster. (You may even be able to override this behaiviour with the posix_fadvise function.)

        My first thought about making the searching multi-threaded is that the disk then has to read from multiple files (for each thread). This probably means the read head on the hard drive will be required to move around the surface of the disk more than just reading each file sequentially.

        It is hard to predict which approach will give the best read performance. I think it's a reasonable to assume that your OS and filesystem try to keep the files stored sequentially on the disk so I would expect searching each file in sequence is probably faster.

        Maybe you might want to time running 'wc' on all of the files in sequence vs all of the files at different concurrencies to see what works best for reading the data.

        The whole point is your disk is probably much slower than your CPU so it will probably be a much bigger bottleneck especially if you go and start moving the read head around a lot.
        Buffering will help, no doubt, but at some point, the process will have to block to read in a couple of bytes or a large chunk of data. Also in the situation with dual CPUs, having both chug away could be adventageous.

        Give me strength for today.. I will not talk it away..
        Just for a moment.. It will burn through the clouds.. and shine down on me.

Re: perl performance vs egrep
by davido (Archbishop) on Jan 23, 2005 at 16:22 UTC

    How about if you do a two-stage process of elimination: (please see the 'Update' first).

    while( <IFILE> ) { next unless m/^[CKMPSWY]/; print OFILE unless /^(?:CP|K[LM]|ME|P[AM]|S[LZ]|WX|YZ)XX1/; }

    This sort of "optimization" is highly sensitive to the type of data, however, if lines starting with [CKMPSWY] are sparse, it will reject non-matches much faster.

    Update: Uggh, my logic is backwards in that snippet. The point is that if you can reject non-matches earlier, with fewer cycles, you save a little time.

    I also wanted to mention the "print late" philosophy. If you're IO bound, you might be better off storing a few dozen lines in a scalar and printing just once every few-dozen iterations instead of on every iteration. This will maximize the effectiveness of the OS's buffering.


      If you're IO bound, you might be better off storing a few dozen lines in a scalar and printing just once every few-dozen iterations instead of on every iteration.

      Im confused by this comment. The only reason I can see such a strategy making a difference is that it will reduce the number of print calls. I'd be surprised if user caching is more effective than the caching that perl itself does.


        print is just plain expensive. Here's a really quickly hacked together benchmark: $ perl -we'open my $zap, ">", "/dev/null"; use Benchmark "cmpthese"; cmpthese -10, { every => sub { for (1..100) { print $zap $_ } }, tens => sub { my $x; for (1..100) { $x .= $_; print $zap substr($x,0,1000,"") unless $x % 10 } } }'
Re: perl performance vs egrep
by sgifford (Prior) on Jan 23, 2005 at 21:22 UTC
    I wrote up a node on using mmap with some grep-like code which ran about twice as fast as processing it line-at-a-time. That might be worth a try; mmap provides very fast file I/O.
Re: perl performance vs egrep
by borisz (Canon) on Jan 23, 2005 at 15:55 UTC
    How log do you spend on a single cat infile >outfile?
    Beside that, I think without testing, this is faster as your re:
    print OFILE unless /^(?:K[LM]|P[AM]|S[LZ]|CP|ME|WX|YZ)XX1/;
      Beside that, I think without testing, this is faster as your re:
      Without testing as well, I doubt that. The Perl regexp engine has a pretty good optimizer, and it does quite well with fixed strings. But one of the easiest ways to bypass the optimizer is the use of character classes.
Re: perl performance vs egrep (why regex in this example?)
by demerphq (Chancellor) on Jan 24, 2005 at 08:47 UTC

    In this situation i'd investigate whether regexes are even required. I bet the following is faster:

    my %skip=map {$_=>1} qw(CPXX1 KLXX1 KMXX1 MEXX1 PAXX1 PMXX1 SLXX1 SZXX +1 WXXX1 YZXX1); while ( <IFILE>) { print OFILE unless $skip{substr($_,0,5)}; }

Re: perl performance vs egrep
by TedPride (Priest) on Jan 23, 2005 at 22:46 UTC
    It might be helpful if you could post a link to a sample data file to do some testing on. We don't know:

    (a) the average length of the lines
    (b) whether a line is likely to contain a match or not
    (c) whether common substrings (XX1) are likely to occur in a line not containing a match to any of the full strings

    For instance, if matches are uncommon, you could read the file in large chunks (x bytes, then read to next line boundary), perform regex on the chunks, then use index and rindex from each match position to select the output boundaries for the lines in between the match lines. This would probably be far more efficient than reading and matching line by line. If substrings aren't like to occur, you could change your match algorithm to first check each line for XX1 using index, then perform the more complicated (preferably optimized) regex match. Efficiency in the line by line method might also be improved by buffering output and printing in chunks - though depending on how Perl manages output, this might just duplicate internal mechanisms. I'd have to run some tests.

Re: perl performance vs egrep
by YetAnotherDave (Beadle) on Jan 23, 2005 at 17:01 UTC
    Would it give a speed improvement to prepare the regex in advance with qr//?

    my $re = qr/^CPXX1|^KLXX1|^KMXX1|^MEXX1|^PAXX1|^PMXX1|^SLXX1|^SZXX1|^WXXX1|^YZXX1/;
    print OFILE unless /$re/;

    I think other monks have already suggested using a buffer to reduce the number of write, this would probably give a fairly major improvement.

      Would it give a speed improvement to prepare the regex in advance with qr//?
      No, it wouldn´t. At least not as long the regex is, like here, the only one within the loop. The qr//-operator is useful when you have the need of matching the same string to multiple regexes/patterns, like in
      @p = ( qr/^abc/, qr/abc$/, ); for $f ( @foo ) { for $p ( @p ) { do_stuff() if $f =~ $p; } }
      Using qr// doesn´t even help in code like
      while ( $foo =~ /^abc/ && $foo =~ /abc$/ ) { do_stuff(); }
      because in such a case internal perl optimizations come into play.

      holli, regexed monk

      Probably not. The perl compiler should be able to see that the entire regexp is completely, 100% static, and no variables in it whatsoever. Thus, it should compile the regexp once (ideally on first call, but I suspect it's done during compilation) regardless. All you're doing is copying that compiled regexp around a bit. Should have practically no effect.

Re: perl performance vs egrep
by sitz (Initiate) on Jan 24, 2005 at 03:06 UTC
    ...and, of course, since your regular expression remains constant throughout your runtime, you can append an 'o' to whatever regex you wind up using...
    print OFILE unless /^$some_regex/o;
    ...which should speed things up a bit.

      That does nothing in this case since there are no variables in the regex; and without understanding what it does, you can easily cause bugs by using it when there are variables. Don't use /o, use qr//.

      Makeshifts last the longest.

        Be damned; been misreading perlop all this time. I sit corrected, and so forth; thanks. =)

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://424382]
Approved by Courage
Front-paged by Old_Gray_Bear
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others musing on the Monastery: (3)
As of 2018-02-25 10:24 GMT
Find Nodes?
    Voting Booth?
    When it is dark outside I am happiest to see ...

    Results (312 votes). Check out past polls.