Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Matching lines in 2+ GB logfiles.

by dbmathis (Scribe)
on May 01, 2008 at 15:04 UTC ( #683943=perlquestion: print w/replies, xml ) Need Help??

dbmathis has asked for the wisdom of the Perl Monks concerning the following question:

Hi Everyone,

I have been searching throught this sitre for any tips on matching lines on huge logfiles and I can across the following node. The script in this node works great and it's almost exactly what I need, but it only returns that text that I am searching for. When I modify it to fit my needs it slows down.

Ref
http://www.perlmonks.org/?node_id=128925

#!/usr/bin/perl -w # # Proof-of-concept for using minimal memory to search huge # files, using a sliding window, matching within the window, # and using on /gc and pos() to restart the search at the # correct spot whenever we slide the window. # # Doesn't correctly handle potential matches that overlap; # the first fragment that matches wins. # use strict; use constant BLOCKSIZE => (8 * 1024); &search("bighuge.log", sub { print $_[0], "\n" }, "<img[^>]*>"); sub search { my ($file, $callback, @fragments) = @_; local *F; open(F, "<", $file) or die "$file: $!"; binmode(F); # prime the window with two blocks (if possible) my $nbytes = read(F, my $window, 2 * BLOCKSIZE); my $re = "(" . join("|", @fragments) . ")"; while ( $nbytes > 0 ) { # match as many times as we can within the # window, remembering the position of the # final match (if any). while ( $window =~ m/$re/oigcs ) { &$callback($1); } my $pos = pos($window); # grab the next block $nbytes = read(F, my $block, BLOCKSIZE); last if $nbytes == 0; # slide the window by discarding the initial # block and appending the next. then reset # the starting position for matching. substr($window, 0, BLOCKSIZE) = ''; $window .= $block; $pos -= BLOCKSIZE; pos($window) = $pos > 0 ? $pos : 0; } close(F); }

For example the regex search doesn't search by line it searches across the entire block and then prints out matches.

I was searching for e-mail addresses in a 2 GB maillog file and when it finds the e-mail it just spits it out

So I modified:

while ( $window =~ m/$re/oigcs ) { &$callback($1); }

To look like this to capture the line (which is what I need):

while ( $window =~ m/\w{3}\s{1,2}\d{1,2}.*$re.*\n/oigc ) { &$callback($1); }

And things slowed considerably. It went for 30 secs to several minutes. How should I modify the code above to spit out the line in which the match was found in without slowing down the search time?

Here is a sample of the lines in the file:

Feb 24 04:03:47 server sendmail[]: khdkahsdad876sad8: to=<sample@colle +geclub.com>, delay=1+13:12:11, xdelay=00:00:00, mail er=esmtp, pri=25672345, relay=collegeclub.com., dsn=4.0.0, stat=Deferr +ed: Connection timed out with collegeclub.com. Feb 24 04:03:47 server sendmail[31356]: madhksadkh5574: to=<sample@iit +.edu>, delay=1+13:20:32, xdelay=00:00:00, mailer=esmtp, pri=26574dffd, relay=sample.iit.edu. [006.47.143.000], dsn=4.3.1, sta +t=Deferred: 452 sample 4.2.1 Mailbox temporarily disabled: sample@iit +.edu


After all this is over, all that will really have mattered is how we treated each other.

Replies are listed 'Best First'.
Re: Matching lines in 2+ GB logfiles.
by mscharrer (Hermit) on May 01, 2008 at 16:02 UTC
    The reason for the slow execution is most likely the use of the two .* in the regex which result in a very high number of checks inside the regex machine. This is difficult to explain as long you don't know what backtracking is and how it works.
    For now just try this:
    while ( $window =~ m/\w{3}\s{1,2}\d{1,2}([^\n]+)\n/oigc && $1 =~ /$re/ +) { &$callback($1); }

    Precompiling $re using qr{} is recommended, or use the o option.

      I could not get this to work. Would not match anything.

      After all this is over, all that will really have mattered is how we treated each other.
Re: Matching lines in 2+ GB logfiles.
by linuxer (Curate) on May 01, 2008 at 15:28 UTC

    Just my first thought; so instead of

    while ( $window =~ m/\w{3}\s{1,2}\d{1,2}.*$re.*\n/oigc ) {
    you could try
    while ( $window =~ m/\w\w\w\s\s?\d\d?.*$re.*\n/iogc ) {

    \w\w\w should run faster than \w{3}, same with \d\d? instead of \d{1,2}

    Edit: and same with \s\s? vs. \s{1,2}. The direction should be clear.

    Edit2: Maybe precompiling the regex with the qr// Operator might give another speedup.
    By the way, I can't remember that /c Modifier, what is it for?

      The /c modifier is always used together with the /g modifier and allows continued search after a failed /g match. Normally pos() is reset after a failed match.

      CountZero

      A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

Re: Matching lines in 2+ GB logfiles.
by NetWallah (Canon) on May 01, 2008 at 16:07 UTC
    Here is a modified version (with my test parameters - please reset them to match your current ones).

    This version adds a SECOND read for the log file, line-at-a-time, trading I/O for CPU, but should still be pretty fast.

    It prints out the line number, and a bunch of other diagnostic/unnecessary info for where the match occurred.

    #!/usr/bin/perl -w # # Proof-of-concept for using minimal memory to search huge # files, using a sliding window, matching within the window, # and using on /gc and pos() to restart the search at the # correct spot whenever we slide the window. # # Doesn't correctly handle potential matches that overlap; # the first fragment that matches wins. # use strict; use constant BLOCKSIZE => 20; ##(8 * 1024); my @findoffset; my $file = "ascii-code.htm"; search( $file, #"bighuge.log", sub { print $_[0], " at offset $_[1]\n"; push @findoffset,$_[1 +]; }, # "<img[^>]*>"); "javasc"); # Re-read file as lines $_=0 for my ($line,$offset,$prev,$idx); open(my $F, "<", $file) or die "$file: $!"; while (<$F>){ $line++; my $len = length($_); next unless (($offset+=$len) >= $findoffset[$idx]); print "$line,$offset,$findoffset[$idx],$len:\t$_"; $idx++; last if $idx > $#findoffset; } close ($F); #------------------------------------------ sub search { my ($file, $callback, @fragments) = @_; my $byteoffset = 0; open(my $F, "<", $file) or die "$file: $!"; binmode($F); # prime the window with two blocks (if possible) my $nbytes = read($F, my $window, 2 * BLOCKSIZE); my $re = "(" . join("|", @fragments) . ")"; while ( $nbytes > 0 ) { # match as many times as we can within the # window, remembering the position of the # final match (if any). while ( $window =~ m/$re/oigcs ) { $callback->($1, $byteoffset); } my $pos = pos($window); # grab the next block $byteoffset += $nbytes; $nbytes = read($F, my $block, BLOCKSIZE); last if $nbytes == 0; # slide the window by discarding the initial # block and appending the next. then reset # the starting position for matching. substr($window, 0, BLOCKSIZE) = ''; $window .= $block; $pos -= BLOCKSIZE; pos($window) = $pos > 0 ? $pos : 0; } close($F); }
    Update 1: Note - there may be subtle issues (I hate to say bugs) under boundary conditions where multiple matches occur on the same line. Special case code needs to be added to handle these, if tis condition is expected.

    Update 2: Thinking about this some more leads me to believe this is not the right way to go about it. It would be a lot more efficient to track newlines on the First read, and buffer/capture/print the lines containing the text right at the spot.

    In other words, in addition to passing the Matching $1, the search sub should callback with the line of text, in context. There may be an issue requiring more sliding window buffering, in case the "line" is split across buffers.

         "How many times do I have to tell you again and again .. not to be repetitive?"

      This worked but was not any faster than egrep. I may just be stuck waiting 30 minutes for egrep to grep these huge files.

      After all this is over, all that will really have mattered is how we treated each other.
Re: Matching lines in 2+ GB logfiles.
by samtregar (Abbot) on May 01, 2008 at 16:51 UTC
    On modern hardware 2GB+ isn't really very big. Have you tried just reading it line-by-line with <F>? I don't know what your performance requirements are but most log-parsing jobs aren't terribly performance sensitive.

    You might find that you don't have to tune your regex much once you switch to reading line-by-line. That's because each line will be much smaller than 8K, so the penalty for backtracking on a .* will be consequently much smaller.

    -sam

      I am basically looking for something faster than grep. I am being forced to grep these huge maillogs that are around 2.5 GB each and I have 6 of these to search through.

      After all this is over, all that will really have mattered is how we treated each other.
        I am basically looking for something faster than grep
        You're unlikely find anything much faster than grep - it's a program a written in C and optimised to scan though a text file printing out lines that match. You may also be running into IO limits. For example, you could try
        time wc -l bigfile
        Which will effectively give you a lower bound (just read the the contents of the file, and find the \n's). If the grep isn't a whole lot slower than that, then there's probably no way to speed it up.

        Dave.

        If you generally know what you are looking for ahead of time, one method is to keep a process always running that tails a log file. This process can then send everything it finds to another file, which can be searched instead.

        If you need to beat grep, you can, but you have to do things that grep can't. This includes knowing how the files are laid out on disk (esp RAID), and how many CPUs you can take advantage of (i.e. lower transparency to raise performance). You can write a multithreaded (or multiprocess) script that will read through the file at specific offsets in parallel. This may require lots of tweaking though (e.g. performance depends on how the filesystem prefetches data, and what the optimum read size is for your RAID). FWIW, you may want to look around for a multithreaded grep.

        Perl is more flexible and powerful than grep but definitive not faster. Also AFIK grep (or was it egrep?) uses a finite state machine, not a infinite one like perl, so it is much faster, but much less flexible, i.e. doesn't support back-tracking, etc..

        Try to optimise your regex to speed things up. In perl you can use use re 'debug'; to show how many permutations you regex causes.

Re: Matching lines in 2+ GB logfiles.
by educated_foo (Vicar) on May 01, 2008 at 16:40 UTC
    Regarding the regex, I would suggest using ^ and $ along with the /m modifier instead of matching for "\n". On a tangential note, this kind of thing is much simpler if you use Sys::Mmap, like in the wide finder benchmark.
Re: Matching lines in 2+ GB logfiles.
by Anonymous Monk on May 02, 2008 at 01:05 UTC

    Has anyone here who is claiming that perl can't outrun grep actual run the script that I posted here that dws wrote? This dws guy is on to something. I was finally able to modify it to work like grep and it's 14 time faster than grep. I am working with a 484 MB mailog.

    This could be more elegant but this my rookie solution..

    while ( $window =~ m/([a-zA-Z]{3}\s{1,2}\d{1,2}.*\n)/oigc ) { $line = $1; if ( $1 =~ /$re/ ) { &$callback($line); } }
    ls -ltrh /var/log/syslog-ng/server2/ | grep maillog.2 -rw-r----- 1 root logs 484M Mar 11 11:13 maillog.2 -rw-r----- 1 root logs 230M Apr 1 04:10 maillog.2.gz [dmathis@aus02syslog ~]$ date; ./jujuspeed; date Thu May 1 19:27:57 CDT 2008 Feb 28 09:53:49 exmx2 sendmail[XXXXX]: 8791: to=<hidden@hotmail.com>, +delay=00:00:01, xdelay=00:00:01, mailer=esmtp, pri=X3604, relay=mx1.h +otmail.com. [X5.5X.2X5.X], dsn=2.0.0, stat=Sent ( <X4X0399.120421402X +XXX.JavaMail.root@hidden.com> Queued mail for delivery) Thu May 1 19:28:10 CDT 2008 Time taken: 13 Seconds [dmathis@aus02syslog ~]$ date; egrep -i 'hidden@hotmail.com' /var/log/ +syslog-ng/server2/maillog.2; date Thu May 1 19:28:48 CDT 2008 Feb 28 09:53:49 exmx2 sendmail[XXXXX]: 8791: to=<hidden@hotmail.com>, +delay=00:00:01, xdelay=00:00:01, mailer=esmtp, pri=X3604, relay=mx1.h +otmail.com. [X5.5X.2X5.X], dsn=2.0.0, stat=Sent ( <X4X0399.120421402X +XXX.JavaMail.root@hidden.com> Queued mail for delivery) Thu May 1 19:31:57 CDT 2008 Time Taken: 189 Seconds

    Thanks for all of the help on here. I have learned alot :)

      while ( $window =~ m/([a-zA-Z]{3}\s{1,2}\d{1,2}.*\n)/oigc ) { $line = $1; if ( $1 =~ /$re/ ) { &$callback($line); } }
      This is very close to what mscharrer suggested before.
        Indeed!

        After all this is over, all that will really have mattered is how we treated each other.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://683943]
Approved by marto
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others about the Monastery: (3)
As of 2021-12-03 09:46 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    R or B?



    Results (28 votes). Check out past polls.

    Notices?