Matts has asked for the
wisdom of the Perl Monks concerning the following question:
This is a hardcore performance question, hope you like it.
I'm the author of XML::SAX::PurePerl, a pure perl XML parser. It's close enough to complete that I've decided lately to examine it's performance issues. It's a very slow parser, especially compared to C based parsers (expat, libxml2), mostly because Perl itself is very slow at this task. So I've decided to post a breakdown of where I'm at, what I've done and hopefully get some ideas for further improvements.
First of all, this parser is character based. This means it parses based on seeing a single character in a stream at a time, and backtracks when it needs to. This makes things slow in Perl, because ironically while Perl deals with strings very quickly, it sucks at dealing with characters. I'm not going to change this architectural decision because it would require a complete re-write of the parser (which is already 3000 lines of code in total).
Anyway, down to the nitty gritty. I'm going to simply post what dprofpp tells me are the most time consuming functions, and then post those functions so you can take a look.
dprofpp output from dprofpp -O 6:
Total Elapsed Time = 9.712727 Seconds
User+System Time = 9.672727 Seconds
Exclusive Times
%Time ExclSec CumulS #Calls sec/call Csec/c Name
26.2 2.538 4.018 86883 0.0000 0.0000 XML::SAX::PurePerl::Reade
+r::nextchar
24.5 2.379 4.518 60628 0.0000 0.0001 XML::SAX::PurePerl::Reade
+r::match_not
23.3 2.259 1.999 86884 0.0000 0.0000 XML::SAX::PurePerl::Reade
+r::Stream::next
11.3 1.099 5.442 4826 0.0002 0.0011 XML::SAX::PurePerl::Reade
+r::consume_not
9.72 0.940 1.415 53003 0.0000 0.0000 XML::SAX::PurePerl::Reade
+r::match_char
9.60 0.929 1.438 19261 0.0000 0.0001 XML::SAX::PurePerl::Reade
+r::match_re
Note that the percentages are a little confusing, since some of those call each other, so there's probably some overlap - I don't know how to get dprofpp to make that a non-issue (i.e. only time spent *actually* in the subroutine, not time spent calling other subs. I don't even think its possible to calculate that).
Anyway, here's the code for those six. Go for your life - shoot me if I've done something stupid. Don't post without running Benchmark.pm on things though - I've already spent a few days on this!
# get the next utf-8 character from the input stream
# and encode as UTF8 if possible/necessary
sub nextchar {
my $self = shift;
$self->next;
return unless defined $self->[CURRENT];
if ($self->[CURRENT] eq "\x0D") {
$self->next;
return unless defined($self->[CURRENT]);
if ($self->[CURRENT] ne "\x0A") {
$self->buffer("\x0A");
}
}
return unless $self->[ENCODING];
my $n = ord($self->[CURRENT]);
# warn(sprintf("ch: 0x%x ($self->[CURRENT])\n", $n));
if (($] < 5.007002) && ($n > 0x7F)) {
# utf8 surrogate
my $current = $self->[CURRENT];
if ($n >= 0xFC) {
# read 5 chars
$self->next; $current .= $self->[CURRENT];
$self->next; $current .= $self->[CURRENT];
$self->next; $current .= $self->[CURRENT];
$self->next; $current .= $self->[CURRENT];
$self->next; $current .= $self->[CURRENT];
}
elsif ($n >= 0xF8) {
# read 4 chars
$self->next; $current .= $self->[CURRENT];
$self->next; $current .= $self->[CURRENT];
$self->next; $current .= $self->[CURRENT];
$self->next; $current .= $self->[CURRENT];
}
elsif ($n >= 0xF0) {
# read 3 chars
$self->next; $current .= $self->[CURRENT];
$self->next; $current .= $self->[CURRENT];
$self->next; $current .= $self->[CURRENT];
}
elsif ($n >= 0xE0) {
# read 2 chars
$self->next; $current .= $self->[CURRENT];
$self->next; $current .= $self->[CURRENT];
}
elsif ($n >= 0xC0) {
# read 1 char
$self->next; $current .= $self->[CURRENT];
}
else {
throw XML::SAX::Exception::Parse(
Message => sprintf("Invalid character 0x%x", $n),
ColumnNumber => $self->column,
LineNumber => $self->line,
PublicId => $self->public_id,
SystemId => $self->system_id,
);
}
if ($] >= 5.006001) {
$self->[CURRENT] = pack("U0A*", $current);
}
else {
$self->[CURRENT] = $current;
}
}
}
######################################################
# match anything *not* in the list of characters given
sub match_not {
my $self = shift;
my $current = $self->[CURRENT];
return 0 unless defined $current;
for my $m (@_) {
if ($current eq $m) {
$self->[MATCHED] = '';
return 0;
}
}
$self->[MATCHED] = $current;
$self->nextchar;
return 1;
}
######################################################
# get the next byte or character (character on 5.7.2+
# or byte on 5.5 or 5.6) from a fh, or from the buffer
# if one exists.
sub next {
my $self = shift;
# check for chars in buffer first.
if (length($self->[BUFFER])) {
return $self->[CURRENT] = substr($self->[BUFFER], 0, 1, ''); #
+ last param truncates buffer
}
if (length($self->[INTERNAL_BUFFER])) {
BUFFERED_READ:
$self->[CURRENT] = substr($self->[INTERNAL_BUFFER], 0, 1, '');
if ($self->[CURRENT] eq "\x0A") {
$self->[LINE]++;
$self->[COLUMN] = 1;
}
else { $self->[COLUMN]++ }
return;
}
my $bytesread = read($self->[FH], $self->[INTERNAL_BUFFER], $self-
+>[BUFFER_SIZE]);
if ($bytesread) {
# yes, goto. If you have a faster way feel free!
goto BUFFERED_READ;
}
elsif (defined($bytesread)) {
$self->[EOF]++;
return $self->[CURRENT] = undef;
}
throw XML::SAX::Exception::Parse(
Message => "Error reading from filehandle: $!",
);
}
##########################################################
# consume everything not matching chars listed in @_
sub consume_not {
my $self = shift;
my $consumed = '';
while(!$self->[EOF] && $self->match_not(@_)) {
$consumed .= $self->[MATCHED];
}
return length($self->[CONSUMED] = $consumed);
}
########################################################
# Does the current token match a single character?
sub match_char {
my $self = shift;
if (defined($self->[CURRENT]) && $self->[CURRENT] eq $_[0]) {
$self->[MATCHED] = $_[0];
$self->nextchar;
return 1;
}
$self->[MATCHED] = '';
return 0;
}
#########################################################
# Does the current token match a single regexp?
sub match_re {
my $self = shift;
if ($self->[CURRENT] =~ $_[0]) {
$self->[MATCHED] = $self->[CURRENT];
$self->nextchar;
return 1;
}
$self->[MATCHED] = '';
return 0;
}
The original code is on CPAN, but not with these performance tweaks. If you feel you need the live code (i.e. the code you see above along with the rest of it) to look at it's available from anon-cvs via: cvs -z3 -d :pserver:anonymous@axkit.org:/home/cvs login/co XML-SAX
Edit Petruchio Wed Feb 6 08:16:30 UTC 2002 - Added READMORE Tag
Re (tilly) 1: XML::SAX::PurePerl Performance by tilly (Archbishop) on Feb 05, 2002 at 12:40 UTC |
If you are aiming for performance, my first question would
be whether you need next and nextchar to be overridable.
Write them as function calls and you will get a
significant speedup. Since the common path when you call
next should be that the next character is in the buffer,
you can get a further speedup by moving the buffer check
into nextchar and skipping the call entirely. Adding this
logic should be a speedup even if the logic is left
alone in the original function because of the relative cost
of an if and a function call.
As for the internals of the next method, I haven't run
benchmark, but the basic approaches I would consider are
the substr approach, splitting the string into an
internal array you shift off, and using a /(.)/gs pattern
match. (Warning on the last. Use of $' etc anywhere will
slow that down. This may be a good reason to avoid no
matter what benchmarking says.) I assume you have tried
all of them? (Probably but it doesn't hurt to check.)
An incidental note. Your goto can be removed from next
with perceivable performance change. Make the if condition
be empty, put an else around throwing the exception, and
then move the BUFFERED_READ section after the decision
logic. This should also be marginally faster because Perl
doesn't have to spent time figuring out where the goto
goes. (That shouldn't be the common path though, so the
change should be marginal.)
I am betting that 5.005 performance is not a priority of
yours. But if you are allowing the logic to skip going to
next anyways, you can replace a lot of the 5.005 logic with
something like the following (untested):
my $len = $n >= 0xFC ? 5
: $n >= 0xF8 ? 4
: $n >= 0xF0 ? 3
: $n >= 0xE0 ? 2
: $n >= 0xC0 ? 1
: throw XML::SAX::Exception::Parse(
Message => sprintf("Invalid character 0x%x", $n),
ColumnNumber => $self->column,
LineNumber => $self->line,
PublicId => $self->public_id,
SystemId => $self->system_id,
);
if ($len <= length($self->[BUFFER])) {
$current .= substr($self->[BUFFER], 0, $len, '');
$self->[CURRENT] = substr($self->[BUFFER], -1);
}
else {
$len -= length($self->[BUFFER]);
$current .= $self->[BUFFER];
$self->[BUFFER] = '';
while (-1 < --$len) {
next($self);
$current .= $self->[CURRENT];
}
}
Again, it is more complex, but the fact you avoid a whole
series of function calls should be a significant speedup. | [reply] [d/l] |
|
If you are aiming for performance, my first question would be whether you need next and nextchar to be overridable. Write them as function calls and you will get a significant speedup.
Actually that's more of a myth than a truism. I did try it, but the speedup wasn't significant, though it varies from perl to perl.
However I think there may be some value in the buffer check being moved to nextchar. I was originally thinking it needed to be in next() because sometimes next() is called on its own (for the encoding detection routines which need a byte-by-byte view), but that read()s in character by character anyway, so it might be a reasonable optimisation.
You're right, I have tried all of the various "give me a character" methods, and substr() comes out on top. Which is a pain in the ass really - it's one point where Perl loses out to python where you can do string[0] to get the first character, just like you can in C.
I'll leave the goto as is for now. It's not as bad as people make out - it's only bad when it's used for all flow control, and I think it's intention is quite clear here. Plus given a 1024 buffer, it's only part of the path 70 times in the parse of this 70K XML file.
I do like that last optimisation though - that coupled with moving the buffer test into nextchar() might make a big difference (though maybe not as I don't think the particular test file in question has any UTF-8 characters in it). I'll try it and come back and let you know.
| [reply] |
|
YMMV, but when I tested it on my machine the popular myth was correct, a method call was massively slower than a function call. (Of course if you do anything interesting in your function...)
Speaking of functions, I am kind of wondering what the purpose of next is. It seems from the code I see in it that it was intended to keep track of things like line numbers. But I don't see the rest of the code that would be needed to do that. (Sign of a change in design?) If that is the case, then what next is really providing is buffering.
But isn't buffering exactly what read is supposed to do for you? OK, its speed is highly platform (and compilation option) dependent. But it seems to me that either you are better off using read, or else next should remove the additional buffering by using sysread.
| [reply] |
|
|
Re: XML::SAX::PurePerl Performance by dash2 (Hermit) on Feb 05, 2002 at 13:00 UTC |
As a sidenote, the -E option for dprofpp (which is on by default) excludes child subroutines from the figures for each subroutine. So I don't think you need to worry about that, unless maybe you have an old dprofpp.
dave hj~ | [reply] |
|
I thought so too, until I added up the percentages.
| [reply] |
Re: XML::SAX::PurePerl Performance by theorbtwo (Prior) on Feb 05, 2002 at 13:11 UTC |
Hola, Matts, all.
This probably isn't as big an improvement as tilly's above, but you can avoid a branch if you check $] at compiletime using a BEGIN block and play a little symref magic to have two different nextchar routines.
Also, though using lexical variables is generaly a win because of maintainablity, if you're trying to get the last ounce of performance, try declaring your 'my' variables at the widest scope you can (file scope), and seeing how much that helps. IIRC, creating a lexical is expensive.
Try running a line-level profiler, or even running a profiler on perl instead of your program.
TACCTGTTTGAGTGTAACAATCATTCGCTCGGTGTATCCATCTTTG
ACACAATGAATCTTTGACTCGAACAATCGTTCGGTCGCTCCGACGC
| [reply] |
Re: XML::SAX::PurePerl Performance by Zaxo (Archbishop) on Feb 05, 2002 at 16:39 UTC |
For comparison, here are the pre-optimization figures for perl-5.6.1, XML::SAX::PurePerl-0.08 from CPAN:
$ perl -d:DProf t/16large.t
1..3
ok 1
ok 2
parsed 80085 bytes in 13 seconds
ok 3
$ dprofpp -O 6
Total Elapsed Time = 11.05672 Seconds
User+System Time = 10.67672 Seconds
Exclusive Times
%Time ExclSec CumulS #Calls sec/call Csec/c Name
36.9 3.940 3.892 115004 0.0000 0.0000 XML::SAX::PurePerl::Reade
+r::match_nonext
14.8 1.590 1.547 86892 0.0000 0.0000 XML::SAX::PurePerl::Reade
+r::Stream::next
10.8 1.160 8.197 114995 0.0000 0.0001 XML::SAX::PurePerl::Reade
+r::match
9.55 1.020 7.152 115004 0.0000 0.0001 XML::SAX::PurePerl::Reade
+r::match_nocheck
9.27 0.990 2.450 86891 0.0000 0.0000 XML::SAX::PurePerl::Reade
+r::nextchar
9.05 0.966 8.389 17607 0.0001 0.0005 XML::SAX::PurePerl::Reade
+r::consume
Same machine, same perl, cvs version of fifteen minutes ago: $ perl -d:DProf t/16large.t
1..3
ok 1
ok 2
parsed 80085 bytes in 16 seconds
ok 3
$ dprofpp -O 6
Total Elapsed Time = 13.08672 Seconds
User+System Time = 11.91672 Seconds
Exclusive Times
%Time ExclSec CumulS #Calls sec/call Csec/c Name
37.5 4.480 4.442 115004 0.0000 0.0000 XML::SAX::PurePerl::Reade
+r::match_nonext
13.2 1.580 1.537 86892 0.0000 0.0000 XML::SAX::PurePerl::Reade
+r::Stream::next
10.6 1.270 9.037 114995 0.0000 0.0001 XML::SAX::PurePerl::Reade
+r::match
10.2 1.220 7.882 115004 0.0000 0.0001 XML::SAX::PurePerl::Reade
+r::match_nocheck
9.02 1.075 9.239 17607 0.0001 0.0005 XML::SAX::PurePerl::Reade
+r::consume
8.48 1.010 2.460 86891 0.0000 0.0000 XML::SAX::PurePerl::Reade
+r::nextchar
It looks like a pessimization for for this machine. Both build directories were on the same spindle and partition.
After Compline, Zaxo
| [reply] [d/l] [select] |
Re: XML::SAX::PurePerl Performance by chromatic (Archbishop) on Feb 05, 2002 at 21:26 UTC |
Looking at all those next() calls makes me wonder if you should add an optional $length parameter to get the next several bytes/characters. Collapsing three to five method dispatches and their concatenations seems like a potential improvement.
I also wonder if it would help to emulate walking through a string with a char pointer as in C -- instead of substr, keep around a variable holding the position in the buffer.
If you think either of these could help, I'd be happy to grab the CVS and play around a bit. | [reply] |
Re: XML::SAX::PurePerl Performance by Juerd (Abbot) on Feb 08, 2002 at 21:22 UTC |
#!/usr/bin/perl
use strict;
use warnings;
use Benchmark qw/cmpthese/;
print "== Single subroutine argument ==\n";
sub b1_shift { my $x = shift; $x . ''; }
sub b1_list { my ($x) = @_; $x . ''; }
sub b1_direct { $_[0] . ''; }
cmpthese( -5, {
'shift' => sub { b1_shift ('X' x rand 5000) },
'list assignment' => sub { b1_list ('X' x rand 5000) },
'direct' => sub { b1_direct('X' x rand 5000) }
});
print "== Two subroutine arguments ==\n";
sub b2_shift { my $x = shift; my $y = shift; $x . $y; }
sub b2_list { my ($x, $y) = @_; $x . $y; }
sub b2_direct { $_[0] . $_[1]; }
cmpthese( -5, {
'shift' => sub { b2_shift ('X' x rand 5000, 'Y' x rand 5
+000) },
'list assignment' => sub { b2_list ('X' x rand 5000, 'Y' x rand 5
+000) },
'direct' => sub { b2_direct('X' x rand 5000, 'Y' x rand 5
+000) }
});
__END__
== Single subroutine argument ==
Benchmark: running direct, list assignment, shift, each for at least 5
+ CPU seconds...
direct: 5 wallclock secs ( 5.10 usr + 0.02 sys = 5.12 CPU) @ 11
+0247.85/s (n=564469)
list assignment: 5 wallclock secs ( 5.14 usr + 0.03 sys = 5.17 CPU)
+ @ 89861.90/s (n=464586)
shift: 6 wallclock secs ( 5.72 usr + 0.03 sys = 5.75 CPU) @ 70
+120.00/s (n=403190)
Rate shift list assignment dire
+ct
shift 70120/s -- -22% -3
+6%
list assignment 89862/s 28% -- -1
+8%
direct 110248/s 57% 23%
+--
== Two subroutine arguments ==
Benchmark: running direct, list assignment, shift, each for at least 5
+ CPU seconds...
direct: 5 wallclock secs ( 5.46 usr + 0.00 sys = 5.46 CPU) @ 57
+434.25/s (n=313591)
list assignment: 6 wallclock secs ( 5.30 usr + 0.02 sys = 5.32 CPU)
+ @ 46535.53/s (n=247569)
shift: 6 wallclock secs ( 5.00 usr + 0.01 sys = 5.01 CPU) @ 42
+526.95/s (n=213060)
Rate shift list assignment direc
+t
shift 42527/s -- -9% -26
+%
list assignment 46536/s 9% -- -19
+%
direct 57434/s 35% 23% -
+-
2;0 juerd@ouranos:~$ perl -e'undef christmas'
Segmentation fault
2;139 juerd@ouranos:~$
| [reply] [d/l] [select] |
|
|