korlaz has asked for the wisdom of the Perl Monks concerning the following question:
Hi all,
I recently came across a speed problem while reading big files (around 1 million lines). The whole process is taking about 100sec i'd like to know if i can speed it up.
I'm using the code below where :
- mergeLogs is used to retrieve only the relevant lines (76 sec)
- filterLog is used to filter lines (23 sec)
- openLogFile returns a handle on the file (0.2 sec)
- The map/sort/map combo is used to sort the lines by time
Is there any way to speedup the process ?
sub mergeLogs()
{
my $day = shift;
my @files = @_;
my @lines;
foreach my $file (@files)
{
my $fh = &openLogFile($file);
if (! defined $fh)
{
warn "$0: ignoring file $file\n";
next;
}
warn "-> processing $file\n" if $opts{'verbose'} > 0;
push @lines, map { &filterLog($_) } grep { /Running|Dump|FromC
+B|Update/o } <$fh>;
close $fh;
}
@lines = map { $_->[0] }
sort { $a->[1] cmp $b->[1] }
map { [ $_, /(\d{2}:\d{2}:\d{2}\.\d{3})/o ] } @lines;
return \@lines;
}
sub filterLog()
{
my $line = shift;
$line =~ s/ {2,}/ /g;
$line =~ s/^((?:\S+ ){3}).+?\[?I\]?:/$1/;
$line =~ s/ ?: / /g;
$line =~ s/ ACK (\w) / $1 ACK /;
return unless ! exists $opts{'day'} || /^$opts{'day'}/o;
return unless ! exists $opts{'user'} || /[\(\[]\s*(?:$opts{'user'}
+)/o;
if (exists $opts{'start-time'} || exists $opts{'stop-time'})
{
if ($line =~ /(\d{2}:\d{2}:\d{2}\.\d{3})/o)
{
return unless ! exists $opts{'start-time'} || $opts{'start
+-time'} lt $1;
return unless ! exists $opts{'stop-time'} || $opts{'stop-t
+ime'} gt $1;
}
}
warn $line if $opts{'verbose'} > 3;
return $line;
}
Re: How to improve speed of reading big files
by Illuminatus (Curate) on Sep 17, 2009 at 15:44 UTC
|
Reading files a line at a time costs you something. Even though filehandle processing is buffered, you are still paying the price for re-filling the buffer at irregular intervals (based on when a line spans a buffer read). If you switch to IO::Handle and use the read method to suck in the entire file first, it will probably be faster. You will, of course, need enough RAM to hold the whole file while you process it. If this is still not fast enough, IO::Handle usually includes the setvbuf method, which allows you to increase the read buffer size. Increasing this to something like 256K may also improve this portion. | [reply] [d/l] [select] |
|
| [reply] [d/l] [select] |
Re: How to improve speed of reading big files
by BrowserUk (Patriarch) on Sep 17, 2009 at 17:55 UTC
|
There is one fairly obvious possibility (though because it is obvious doesn't mean it'll improve your performance much):
Remove one entire pass of the data by combining the map & grep:
push @lines, grep { /Running|Dump|FromCB|Update/o && filterLog($_) } <
+$fh>;
(And get rid of that & unless you know what it is doing!).
Beyond that, most of the possibilities lie inside your filter sub, but before dicking around with that, and in the absence of any easy way to test the changes, you are going to have to explain a couple of things.
- unless ! exists $opts{'day'} || /^$opts{'day'}/o;
That might be equivalent to if exists $opts{...} and $line !~ /^$opts{...}?
But as you are not specifying $line, it is implicitly using $_; which may be the same thing but with your use of & on the call and my vague memory that it does something special (that I never use so cannot remember), I'm unsure quite what the final semantics of your double negative cond or cond is actually doing?
- Same goes for the two similar expressions further down:
return unless ! exists $opts{'start-time'} || $opts{'start-time'} lt $
+1;
return unless ! exists $opts{'stop-time'} || $opts{'stop-time'} gt $1;
I think those lines are equivalent to
return if exists $opts{'start-time'} and $opts{'start-time'} ge $1;
return if exists $opts{'stop-time'} and $opts{'stop-time'} le $1;
But without the ability to test it, I'm not 100% certain.
- And while I'm at it, why are you using empty prototypes on subs which take arguments?
sub mergeLogs() {
sub filterLog() {
Do you know what those empty parens mean? Or what they do?
Assuming I've understood those compound conditionals, you might gain a little by recoding your filter sub to use $_ exclusively (and yes, it will send the critics into apoplexy :):
sub filterLog() {
s/ {2,}/ /g;
s/^((?:\S+ ){3}).+?\[?I\]?:/$1/;
s/ ?: / /g;
s/ ACK (\w) / $1 ACK /;
return if exists $opts{'day'} and not /^$opts{'day'}/o;
return if exists $opts{'user'} and not /[\(\[]\s*(?:$opts{'user'})
+/o;
if (exists $opts{'start-time'} || exists $opts{'stop-time'}) {
if ($line =~ /((\d{2}):(\d{2}):(\d{2})\.(\d{3}))/o) {
return if exists $opts{'start-time'} and $opts{'start-time
+'} lt $1;
return if exists $opts{'stop-time'} and $opts{'stop-time'
+} gt $1;
}
}
warn $line if $opts{'verbose'} > 3;
return pack 'da*', ( $2*60 + $3 )*60 + "$4.$5" ), $_;
}
Update: Actually. if your original sort using cmp works, then there is no need to pack the date, you can just prepend it in its string form and the GRT woudl still work. You'd have to adjust the substr accordingly.
And look closely at that last (return) line and the date regex. I'm prepending the time (as a packed double) to the line. This would necessitate another change in the grep to assign the return value back to $_:
grep { /Running|Dump|FromCB|Update/o && ( $_ = filterLog($_) ) } <$fh>
+;
The benefit of this (assuming it works; this is all speculative--you'll have to test it), is that you can now avoid parsing the time twice, and the 'zillions of anonymous arrays' syndrome of the ST sort, and use a GRT instead:
@lines = map{
substr $_, 8
} sort @lines;
You don't say how much of the time is spent sorting, but that should reduce it.
My final speculation is that the bunch of modifications you make to the log lines at the start of the filter sub:
s/ {2,}/ /g;
s/^((?:\S+ ){3}).+?\[?I\]?:/$1/;
s/ ?: / /g;
s/ ACK (\w) / $1 ACK /;
in the absence of any actual data, don't appear to affect the filtration regexes. If that is true, and those manipulations are required for a later stage of the processing, then move them to the end of the sub so that you aren't making multiple passes of every line including all those you will subsequently reject.
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] [select] |
|
Small suggestion..I would get rid of both the map{} and the grep{}. Both of these things use anon lists and at the end of day, some @anon_list gets pushed onto @lines. A simple loop that doesn't use these constructs should be faster.
while (<$fh>)
{
if (/Running|Dump|FromCB|Update/o)
{
push ( @lines, filterLog($_) );
}
}
Update:
push @lines, grep { /Running|Dump|FromCB|Update/o && filterLog($_) } <
+$fh>;
Is a good idea, but don't think that it works. @lines needs to get the output of filterLog(). Grep is just a simple "filter", so this would need to be a map{} otherwise looks like you always get the value of <$fh> when the value of the grep statement is "true"? | [reply] [d/l] [select] |
|
Agreed on both counts. The grep would have to become map not vice versa, and doing away with both is better yet.
At that point, I'd probably unwrap the filter sub back into the while loop so the regexes can operate directly on $_ "legitimately".
And at that point, you start to wonder how many times each set of files is reprocessed with different sets of user/date/start-time/stop-times?
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] |
|
|
|
Thanks for your reply, it helped a lot.
I gained 10 sec on the filterLog function and 10 more on the mergeLogs (without counting the filterLog part).
Here is the final version of the code:
sub mergeLogs
{
my ($day, @files) = @_;
my @lines;
foreach my $file (@files)
{
my $fh = openLogFile($file);
if (! defined $fh)
{
warn "$0: ignoring file $file\n";
next;
}
warn "-> processing $file\n" if $opts{'verbose'} > 0;
push @lines, grep { /Running|Dump|FromCB|Update/o && &filterLo
+g } <$fh>;
close $fh;
}
use Benchmark 'cmpthese';
cmpthese( -1,
{
'ST' => sub { ( sortLinesST(\@lines) )[ 0 ] },
'GRT' => sub { ( sortLinesGRT(\@lines) )[ 0 ] },
'GRT2' => sub { ( sortLinesGRT2(\@lines))[ 0 ] },
'GRT3' => sub { ( sortLinesGRT3(\@lines) )[ 0 ] },
}
);
exit;
}
sub filterLog
{
return 0 if exists $opts{'day'} && ! /^$opts{'day'}/o;
if (! exists $opts{'server'})
{
if (/\* Running on (\w+) -/)
{
$opts{'server'} = lc $1;
return 0;
}
}
if (exists $opts{'start-time'} || exists $opts{'stop-time'})
{
if (/(\d{2}:\d{2}:\d{2}\.\d{3})/o)
{
return 0 if exists $opts{'start-time'} && $1 lt $opts{'sta
+rt-time'};
return 0 if exists $opts{'stop-time'} && $1 gt $opts{'stop
+-time'};
}
}
return 0 if exists $opts{'user'} && ! /[\(\[]\s*(?:$opts{'user'})/
+o;
s/ {2,}/ /go;
s/ ?: / /go;
s/^((?:\S+ ){3}).+?\[?I\]?:/$1/o;
s/ ACK (\w) / $1 ACK /o;
warn $_ if $opts{'verbose'} > 3;
return 1;
}
sub sortLinesST
{
my $href = shift;
return [ map { $_->[0] } sort { $a->[1] cmp $b->[1] } map { [ $_,
+/(\d{2}:\d{2}:\d{2}\.\d{3})/o ] } @$href ];
}
sub sortLinesGRT
{
my $href = shift;
return [ map { substr($_, 12) } sort map { /(\d{2}:\d{2}:\d{2}\.\d
+{3})/o; $1 . $_ } @$href ];
}
sub sortLinesGRT2
{
my $href = shift;
return [ map { substr($_, 4) } sort map { /(\d{2}):(\d{2}):(\d{2})
+\.(\d{3})/o; pack ('NA*', ( $1*60 + $2 )*60 + "$3.$4" ) . $_ } @$href
+ ];
}
sub sortLinesGRT3
{
my $href = shift;
return [ @$href[ map { unpack "N", substr($_, -4) } sort map { $hr
+ef->[$_] =~ /(\d{2}:\d{2}:\d{2}\.\d{3})/o; $1 . pack ("N", $_) } 0..$
+#$href ] ];
}
I ran some benchmarks to find the appropriate sort technique, here are the results:
Rate GRT2 ST GRT3 GRT
GRT2 30.2/s -- -19% -42% -45%
ST 37.1/s 23% -- -28% -32%
GRT3 51.9/s 72% 40% -- -5%
GRT 54.6/s 81% 47% 5% --
I would think first that the GRT3 sub would be the fastest using the index sorting. But apparently not.
However i do not understand why the GRT2 sub is slower than the ST sub ? | [reply] [d/l] [select] |
|
I would think first that the GRT3 sub would be the fastest using the index sorting. But apparently not.
There is a famous quote (that I cannot actually remember :), that says something to the effect that programmers intuition about where the bottlenecks occur in their code is usually wrong. I've found that this is never more true than with Perl.
As sort is essentially only shuffling pointers and aliases around, whether those point to short string or long ones makes little difference.
When you have the full record appended to the sort key, if the first N chars are identical, (as they often will be if all your records start with the date 'yyyymmdd' and you are selecting by the date), then the comparison will have to process a bit further through the string before deciding their relationship, than when you have the unique binary index appended to the sort key where it will go at most 4 bytes beyond the key. Making the index sort a little faster. However, you then have to do a full array slicing operation to reorder the records which is comparatively more expensive than scanning a few extra bytes on each comparison.
However i do not understand why the GRT2 sub is slower than the ST sub ?
It's all down to the number of Perl primitives I guess. One capture versus 4 in the regex, plus the math is enough to tip the balance in favour of the ST for lowish numbers (~6000 in my quick runs of your benchmark) of records.
However, your GRT2 implementation is flawed in a several ways:
pack ('NA*', ( $1*60 + $2 )*60 + "$3.$4" ) . $_
- You're concatenating the full record to the end of the pack'd key, so the 'A*' part of the template is redundant.
That should wither be pack ('NA*', ( $1*60 + $2 )*60 + "$3.$4", $_ )
or pack ('N, ( $1*60 + $2 )*60 + "$3.$4" ) . $_
- More importantly, by using 'N' rather than 'd' as I did, you are effectively inting the seconds, which means times that are within the same second will not be ordered correctly.
However, I understand why you probably did that. The IEEE 754 format is in-part specifically designed to allow the binary representation to sort correctly. However, that does not take into account that the Intel native representation byte-swaps it. So, in order to sort packed binary doubles correctly, you have to undo the byte-swapping.
- And that also highlights a second stupidity of mine that you've carried through.
Why capture the seconds an microseconds as separate bits and then stick them back together, when you can capture them as a single piece!
GRT2 is better coded as:
sub sortLinesGRT2
{
my $href = shift;
return [
map {
unpack 'x[d]a*', $_
} sort map {
/(\d{2}):(\d{2}):(\d{2}\.\d{3})/o;
scalar( reverse pack ('d', ( $1*60 + $2 )*60 + $3 ) ) . $_
} @$href
];
}
This corrects the errors and sorts correctly. Surprisingly, it also runs a bit quicker than your original, overtaking the ST at around 1500 records rather than 6000. But it is still way slower than the straightforward GRT implementation:
C:\test>for /l %i in (1,1,5) do @795905 -L=1e%i
Rate GRT2 GRT3 ST GRT
GRT2 15665/s -- -26% -26% -45%
GRT3 21060/s 34% -- 0% -26%
ST 21060/s 34% 0% -- -26%
GRT 28550/s 82% 36% 36% --
Rate GRT2 ST GRT3 GRT
GRT2 1687/s -- -14% -29% -48%
ST 1969/s 17% -- -18% -39%
GRT3 2390/s 42% 21% -- -26%
GRT 3251/s 93% 65% 36% --
Rate GRT2 ST GRT3 GRT
GRT2 166/s -- -1% -28% -47%
ST 169/s 2% -- -27% -46%
GRT3 232/s 40% 38% -- -26%
GRT 312/s 88% 85% 34% --
Rate ST GRT2 GRT3 GRT
ST 13.4/s -- -16% -39% -53%
GRT2 16.0/s 20% -- -27% -44%
GRT3 22.0/s 64% 37% -- -23%
GRT 28.6/s 113% 78% 30% --
(warning: too few iterations for a reliable count)
(warning: too few iterations for a reliable count)
(warning: too few iterations for a reliable count)
(warning: too few iterations for a reliable count)
Rate ST GRT2 GRT3 GRT
ST 0.986/s -- -28% -45% -56%
GRT2 1.38/s 40% -- -23% -39%
GRT3 1.78/s 80% 29% -- -21%
GRT 2.26/s 129% 64% 27% --
Update: Finally, you are currently parsing the time twice: once in the filter sub and again for the GRT. If you prepended the sort key within the filter sub, you could avoid another complete pass of records.
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] [select] |
|
If you have the time and inclination, try this and see how you fare. You might have to make a few tweaks, it compiles clean but is otherwise untested beyond a mental run through, which given my mind is notoriously unreliable:)
sub mergeLogs {
my ($day, @files) = @_;
my @lines;
foreach my $file (@files) {
my $fh = openLogFile($file);
warn "$0: ignoring file $file\n", next unless defined $fh;
warn "-> processing $file\n" if $opts{'verbose'} > 0;
while( <$fh> ) {
next unless /Running|Dump|FromCB|Update/o;
next if exists $opts{'day'} && ! /^$opts{'day'}/o;
next if exists $opts{'user'} && ! /[\(\[]\s*(?:$opts{'user
+'})/o;
$opts{'server'} = lc $1, next
if !exists $opts{'server'} && /\* Running on (\w+) -/;
my $time = /(\d{2}:\d{2}:\d{2}\.\d{3})/o;
next if exists $opts{'start-time'} && $time lt $opts{'star
+t-time'};
next if exists $opts{'stop-time'} && $time gt $opts{'stop
+-time'};
s/ {2,}/ /go;
s/ ?: / /go;
s/^((?:\S+ ){3}).+?\[?I\]?:/$1/o;
s/ ACK (\w) / $1 ACK /o;
warn $_ if $opts{'verbose'} > 3;
## prepend the time key now save reparsing later
push @lines, $time . $_;
}
close $fh;
}
## Sort in-place to save memory shuffling
## (if it decides to play ball today)
@lines = sort @lines;
substr $_, 0, 12, '' for @lines; ## Trim the prepended keys
return \@lines;
}
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] |
|
|
Re: How to improve speed of reading big files
by bv (Friar) on Sep 17, 2009 at 16:12 UTC
|
You seem to be using pretty structured data (maybe the wrong word here, sorry), yet only one of your regexps is anchored. If you know the position of certain strings (especially /Running|Dump|FromCB|Update/o), you could gain some speedup by anchoring, even if you have to resort to assigning to pos.
Another thing would be to precompile or use the /o modifier on your substitutions, since &filterLog gets called so often.
print pack("A25",pack("V*",map{1919242272+$_}(34481450,-49737472,6228,0,-285028276,6979,-1380265972)))
| [reply] [d/l] [select] |
|
Could you explain me the "pos" trick ?
| [reply] |
|
From the pos documentation:
pos directly accesses the location used by the regexp engine to store the offset, so assigning to pos will change that offset, and so will also influence the \G zero-width assertion in regular expressions.
So if we know that we have fixed-width data and we only want to match "foo", "ding", or "widget" starting at the 10th character of the string, we can assign to pos($string) to set the "last matched position" to something other than 0, and then use the \G anchor (technically a zero-width assertion, like ^ and $) like so:
while(<DATA>)
{
pos() = 10;
print if /\G(?:foo|ding|widget)/o;
}
__DATA__
3.14 PI foo
6.28 2PI ding
42 LUE answer
7 LUK superstition
87.32 UNK widget
This code prints lines 1, 2, and 5
print pack("A25",pack("V*",map{1919242272+$_}(34481450,-49737472,6228,0,-285028276,6979,-1380265972)))
| [reply] [d/l] [select] |
|
| [reply] |
Re: How to improve speed of reading big files
by graff (Chancellor) on Sep 18, 2009 at 04:03 UTC
|
First off, figure out how fast you could possibly go, given the amount of data you have:
sub checkReadTime # call this with just your list of files
{
my $linecount = 0;
my $starttime = time;
for my $file ( @_ ) {
my $fh = &openLogFile($file) or next;
while ( <$fh> ) {
$linecount++;
}
close $fh;
}
my $endtime = time;
warn sprintf( "read %d lines from %d files in %d sec\n",
$linecount, scalar @_, $endtime - $starttime );
}
The difference between the duration reported there and the duration of your current app is the upper bound on how much better you might be able to do.
Apart from that, if you have a serious problem with how long it's taking, maybe you should be doing more with the standard compiled unix tools that do things like sorting. For instance, you could put your filtering step into a separate script, pipe its output to "sort", and pipe the output from sort into whatever script is doing the rest of the work on the sorted lines.
If your app isn't geared to a pipeline operation like this:
filterLogFiles file.list | sort | your_main_app
then just put the first part of that into an open statement inside your main app:
open( my $logsort, "-|", "filterLogFiles @files | sort" ) or die $
+!;
while ( <$logsort> ) {
...
}
Of course, you can include option args for your filterLogFiles process so that it can skip the lines that you don't need depending on dates or whatever, and have it output the data in a form that will be easiest to sort (and easy for your main app to digest). | [reply] [d/l] [select] |
|
I can second the idea of using standard Unix tools. In particular awk(1) can deliver huge performance improvements over perl. We had a text-file processing app which went from 20 minutes to 20 seconds when we recoded in awk. YMMV of course.
| [reply] |
|
This is unbelievable! Most likely your awk guys didn't know how to write Perl very well. They coded awk more efficiently because that is what they knew how to do.
| [reply] |
|
|
|
I like graff's benchmark idea. I'd like to point out something that hopefully will happen.... if you run the benchmark the second time, it may speed up!!
I don't know what OS the OP has or how big total file size is, but there is a level of caching that the OS does that Perl doesn't even see. As an example, I have one app that works with a dataset of 700 text files. The user can use a search window to find various things in this dataset. The first search takes about 7 seconds. The second search is so fast that it is barely perceptible to the user (>10x performance). The reason why things get faster is that my machine has lots of memory that the OS can use for file cache. The OS knows that these files haven't been modified and there is very little disk access going on. I am running on WinXP which isn't exactly famous for performance or "smarts".
The number of things that can affect I/O performance are legion and I have very little information to go on in the OP's case. Anyway in this particular app, I quit optimizing because on user query #8, they have already forgotten that query #1 took a long longer!
| [reply] |
Re: How to improve speed of reading big files
by leocharre (Priest) on Sep 17, 2009 at 16:38 UTC
|
- Hardware?
What is the data stored on? A local hard drive? Somewhere on the network?
If it's a local hard drive, loo up the model and make. Look up the specs. What speed does it read/write at?
Maybe upgrade that hard drive to something better.
What about the processor? What speed is it? Is it 32 or 64 bit.. etc..
- Profiling?
How about while it runs your check # top
How about profiling the script?
- Changing the code around a bit?
What if you first read he entire contents of the file to an array- in one step. At that point the data will be in memory. And then process the stuff.. It may have an impact on the speed. Who knows. It's worth a try.
( Illuminatus suggested this )
My semi educated guess is that anything short of a hardware change won't shave off more than 20 secs.
| [reply] |
|
- Hardware: local drive SCSI (can't change it)
- Profiling: writing in parenthesis
- Changing the code around a bit: the file is already read as an array (grep function on the handle):
grep { /Running|Dump|FromCB|Update/o } <$fh>
| [reply] [d/l] |
|
Are you suggesting the items in your original post (secs), is your profiling output? Are you sure you read Devel::Dprof?
Do you understand that you can change your code around to get real nice detail about what's taking time, etc?
| [reply] |
Re: How to improve speed of reading big files
by salva (Canon) on Sep 18, 2009 at 01:16 UTC
|
If the bottleneck is on the sorting step, try using Sort::Key:
use Sort::Key qw(nkeysort_inplace);
nkeysort_inplace { /(\d{2}):(\d{2}):(\d{2})\.(\d{3})/ and "$1$2$3.$4"
+} @lines;
| [reply] [d/l] |
Re: How to improve speed of reading big files
by kikuchiyo (Hermit) on Sep 18, 2009 at 21:39 UTC
|
I had a similar problem earlier where I had to speed up the text file parsing part of my application.
Without really going through your code, here are a few tips that may or may not come handy.
- I recommend Devel::NYTProf for profiling. It gives you a very nice color-coded, line-by-line report on your code that you can use to figure out exactly where the bottlenecks are.
- I found that substr is the fastest way to parse data out of lines that consist of fixed width fields only.
- tr/// is faster than s///. (I don't know whether this is applicable here, though.) Anyway, the fewer regexps, the better.
- In the inner conditional of your filterLog function, do you really need to test the existence of your $opts{} variables?
| [reply] |
|
|