Re: Pimp My RegEx
by dbwiz (Curate) on May 31, 2005 at 18:43 UTC
|
I would have said the same thing that dragonchild pointed
out, i.e. that since you are matching the beginning of a line,
you can apply your regexp line by line.
However, I'll tell you something more that will save
you 30 to 50% of execution time:
Remove "use English;" and change $INPUT_RECORD_SEPARATOR
into $/. This will avoid the malicious influence
of the so called evil variables ($`, $&, $').
I have actually benchmarked your script with a few hundred
lines of logs, and this change saves about 40% of execution time on
my laptop.
| [reply] [Watch: Dir/Any] [d/l] [select] |
|
I'll take your advice and pull out the "use english" command. I thought the "use english" syntax would be the kind thing to do for the benefit of my coworkers. They cringe when they see a complex regex and a "$/" will have them running to the camel book!
Please elaborate on the "evil variables". I'm not really familiar with them and their implications.
| [reply] [Watch: Dir/Any] |
|
Sometimes, a comment will work just as well, without the overhead. Consider the following:
local $/; # INPUT_RECORD_SEPARATOR
As another option, in the documenation for 'English', there's a suggestion to use:
use English qw( -no_match_vars ) ;
to prevent the problems mentioned.
As for the 'evil variables', the following note is in 'perldoc perlre': WARNING: Once Perl sees that you need one of $&, $`, or $' anywhere in
the program, it has to provide them for every pattern match. This may
substantially slow your program. Perl uses the same mechanism to produce $1, $2, etc, so you also pay a price for each pattern that contains capturing parentheses. (To avoid this cost while retaining the
grouping behaviour, use the extended regular expression "(?: ... )"
instead.) But if you never use $&, $` or $', then patterns without
capturing parentheses will not be penalized. So avoid $&, $', and $`
if you can, but if you can't (and some algorithms really appreciate
them), once you've used them once, use them at will, because you've
already paid the price. As of 5.005, $& is not so costly as the other
two.
| [reply] [Watch: Dir/Any] [d/l] [select] |
|
|
local $/ = $my_input_sep; # Custom input separator (instead of \n)
Yoda would agree with Perl design: there is no try{}
| [reply] [Watch: Dir/Any] [d/l] |
|
| [reply] [Watch: Dir/Any] |
Re: Pimp My RegEx
by demerphq (Chancellor) on May 31, 2005 at 18:51 UTC
|
Your regex is really horribly inefficient. Think about what it has to do to match that pattern. A much simpler way to do the same thing is:
#!/usr/local/bin/perl
use warnings;
use strict;
use English;
use Data::Dumper;
use Time::HiRes 'time';
my $logfile;
my $count = 0; # Initialize counter
my $start = time(); # Start the timer
my $fullrec;
while( <DATA>) {
if (/^\d\d\d\d-\d\d-\d\d \d\d:\d\d:\d\d:\d\d\d/) {
$count++;
#if ($fullrec) {
# process($fullrec);
#}
$fullrec = $_;
} else {
$fullrec .= $_;
}
}
#if (defined($fullrec)) {
# process($fullrec);
#}
my $end = time(); # Stop the timer
my $elapsed = $end - $start; # How long did that take?
my $average = $elapsed/$count; # Average processing time
printf "Parsed $count log file entries in %.4f seconds, averaging %.4f
+\n", $elapsed, $average;
exit;
__DATA__
<SNIPPED DATA>
Ie you know that when a line starts with a data that it begins a record, this also implys it denotes the end of the previous record (if such a record exists). Thus you simply need to check each line to see if it begins the record, and then do something with the previous one that you have constructed. This also means that the pattern is anchored and only needs to compare Lr (the length of pattern) chars per line instead of the the Lf (the length of the file) times Lr that your code would do (with the look ahead assertion).
So your pattern does something like 55000000*23 character lookups, even worse most of those will be char class lookups so they are inefficient to start with. If you use the line by line approach you are dealing with 366000*23 lookups. Thats a LOT less. (Actually these are upper bounds, but i think the point is made)
Update:Fixed as per ikegami's reply.
---
$world=~s/war/peace/g
| [reply] [Watch: Dir/Any] [d/l] |
|
while( <DATA>) {
if (/^\d\d\d\d-\d\d-\d\d \d\d:\d\d:\d\d:\d\d\d/) {
$count++;
#if (defined($fullrec)) {
# process($fullrec);
#}
$fullrec = $_;
} else {
$fullrec .= $_;
}
}
#if (defined($fullrec)) {
# process($fullrec);
#}
or look at my solution to eliminate this redundancy.
| [reply] [Watch: Dir/Any] [d/l] |
|
| [reply] [Watch: Dir/Any] |
|
Thanks for sharing your wisdom! I can't believe how quickly you guys came up w/ replies. The PerlMonks community is a great resource!
I wanted to use file slurp mode because so that could MOVE old log file entries from the current log into a seperate monthly log file. (and then tar/gzip 'em)
I thought that by using the slurp mode, I could essentially edit the file in place without having to create a temporary file. The file system these logs live in is often constrained for space.
Thanks again!
| [reply] [Watch: Dir/Any] |
|
I thought that by using the slurp mode, I could essentially edit the file in place without having to create a temporary file.
No, that is unsound. You can corrupt your log files.
Consider:
- Slurp the log file.
- Apply regex to file contents.
- Write the log file with new file contents.
Now, if the write fails due to "disk full" (a distinct possibility in your environment) or because you lose
power at that instant, or for any old reason, you have
just corrupted your log file.
Generally, you should write the modified file to a
temporary file and rename the temporary file over the original only after checking that the temporary file was created without error.
| [reply] [Watch: Dir/Any] |
|
I could essentially edit the file in place without having to create a temporary file.
Generally this isn't possible. You can only overwrite the existing parts of the log, not slice them out of the file. The normal way to do what you want is to process your input records as shown here by me and several of the other monks and then process them and output them to the new file, and then delete the old file. You can't really remove from the beginning of a file while something is writing to the end.
Anyway, glad to help. :-)
---
$world=~s/war/peace/g
| [reply] [Watch: Dir/Any] |
|
Re: Pimp My RegEx
by dragonchild (Archbishop) on May 31, 2005 at 18:35 UTC
|
You don't need to slurp. Either the line matches the regex /^$date_regex/ or it doesn't. (I'm assuming there's no chance the middle of a record will match that regex.)
perl -n -e '$count++ if /^REGEX/;}END{print "$count\n";' FILENAME
And, yes, 55mb will blow out a 3G+ RAM. You should expect a 10-1 to 20-1 increase in memory usage over disk size. This is Perl ...
- In general, if you think something isn't in Perl, try it out, because it usually is. :-)
- "What is the sound of Perl? Is it not the sound of a wall that people have stopped banging their heads against?"
| [reply] [Watch: Dir/Any] [d/l] [select] |
|
| [reply] [Watch: Dir/Any] |
|
I was unclear - you wouldn't need the whole regex anymore. The problem has been restated, so the new regex should be more restrictive. At the very minimum, it would not be capturing across $/.
- In general, if you think something isn't in Perl, try it out, because it usually is. :-)
- "What is the sound of Perl? Is it not the sound of a wall that people have stopped banging their heads against?"
| [reply] [Watch: Dir/Any] |
Re: Pimp My RegEx
by ikegami (Patriarch) on May 31, 2005 at 18:55 UTC
|
The following uses a simpler regexp, and it doesn't slurp the file into memory. It's probably faster:
#!/usr/local/bin/perl
use warnings;
use strict;
my $record_start_regexp = qr/^\d{4}-\d\d-\d\d \d\d:\d\d:\d\d:\d\d\d/;
sub do_something {
print("[\n");
print($_[0]);
print("]\n");
}
{
local $_;
my $data;
for (;;) {
$_ = <DATA>;
last if not defined $_;
last if /$record_start_regexp/;
}
while (defined) {
$data .= $_;
$_ = <DATA>;
if (!defined || /$record_start_regexp/) {
do_something($data);
$data = '';
}
}
}
| [reply] [Watch: Dir/Any] [d/l] |
Re: Pimp My RegEx
by Roy Johnson (Monsignor) on May 31, 2005 at 18:47 UTC
|
I get better performance with this:
my $parse_log_entry = qr/^$dateRegex/ms;
...
my $prevpos = 0;
while( $logfile =~ m/$parse_log_entry/g ) {
$count++; # count number of parsed strings
if ($prevpos) {
# print "Do something with $prevpos .. $-[0]\n";
# print substr($logfile, $prevpos, $-[0]-$prevpos), "\n";
}
$prevpos = $-[0];
}
You would use the substr expression above to access what you're currently capturing, though you don't indicate that what you're capturing is actually used at all. .
Caution: Contents may have been coded under pressure.
| [reply] [Watch: Dir/Any] [d/l] |
|
You would use the substr expression above to access what you're currently capturing, though you don't indicate that what you're capturing is actually used at all.
I made a grossly simplified version of my script to share with the forum. Posting the script in its entirety would have just bored folks to tears! ...but rest assured, I am doing something with the captured text. I'm not sure your solution will work for log entries that span multiple lines.
| [reply] [Watch: Dir/Any] |
|
| [reply] [Watch: Dir/Any] |
Re: Pimp My RegEx
by tlm (Prior) on May 31, 2005 at 18:35 UTC
|
my $parse_log_entry = qr/^($dateRegex.*?)(?=(?:^$dateRegex|\z))/ms;
? If so, then it I would process the file line-wise, and count only those lines that begin with your date pattern.
Or, if you need to do more than just counting, then something along the lines of
my $current;
while ( <DATA> ) {
if ( /^$dateRegex/ ) {
process( $current ) if defined $current;
$current = $_;
}
else {
$current .= $_;
}
}
process( $current ) if defined $current;
Update: Added the more extended second alternative.
Update 2: Fixed bug (last line of code was missing).
| [reply] [Watch: Dir/Any] [d/l] [select] |
|
For the sake of brevity, I cut out my "payload" and just counted lines. I counted lines just to benchmark the regex times.
My rationale was something like this:
"find a line that starts with a date, capture all of the text (including new lines) until I can lookahead and find another date."
| [reply] [Watch: Dir/Any] |
|
Yes, but as originally written your regex would miss the last log entry, because it fails the positive lookahead assertion.
Regarding the payload, please see the update to my original reply.
Update: Added further comments on the error in the original regex.
| [reply] [Watch: Dir/Any] [d/l] [select] |
|
Re: Pimp My RegEx
by heathen (Acolyte) on May 31, 2005 at 21:11 UTC
|
Once again, thanks for the suggestions. My last benchmark showed 179,963 log file entries parsed in 7.17 seconds. Not bad! ....now time to go back and add the File I/O operations. | [reply] [Watch: Dir/Any] |
|
| [reply] [Watch: Dir/Any] |
Re: Pimp My RegEx
by TedPride (Priest) on May 31, 2005 at 19:22 UTC
|
If disk space is so low that you can't fit a second copy of the file, you need more disk space. Barring that, you could try creating a temporary RAM disk, copying the file into it, reading the file line by line, then writing back an interpreted copy to the original location. Reading from a RAM disk is significantly faster than reading from the hard disk, and the file will only require its actual size in memory instead of its Perl size.
(be sure to create a backup, compressed copy of the file first, however - there's always a tiny chance your comp will crash or something will go wrong, and the copy of the file in the RAM disk will go poof) | [reply] [Watch: Dir/Any] |
Re: Pimp My RegEx
by jcoxen (Deacon) on Jun 01, 2005 at 14:25 UTC
|
#!/usr/bin/perl
use warnings;
use strict;
open INFILE, "< $ARGV[0]"
or die "Could not open file $ARGV[0]: $!";
until(eof INFILE) {
my $fn = $ARGV[0];
open OUTFILE, "> $fn.filtered"
or die "Could not open file $fn.filtered $!";
while(<INFILE>) {
my @line = split /\t/;
$line[0] =~ s/(.{1,6}).*/$1/;
$line[3] =~ s/(.{1,6}).*/$1/;
$line[6] = $line[6]."\n";
$_ = join("\t", $line[0], $line[1], $line[2], $line[3], $line[
+4], $line[5], $line[6]);
print OUTFILE;
}
}
BTW, this code processed an 8 million line/385 mb file in about 33 minutes the other day on my old, anemic Sun Ultra 5 workstation. No brownouts, no sweat. System load never got above 1.05.
Jack | [reply] [Watch: Dir/Any] [d/l] |