Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

XML::SAX::PurePerl Performance

by Matts (Deacon)
on Feb 05, 2002 at 11:47 UTC ( #143416=perlquestion: print w/ replies, xml ) Need Help??
Matts has asked for the wisdom of the Perl Monks concerning the following question:

This is a hardcore performance question, hope you like it.

I'm the author of XML::SAX::PurePerl, a pure perl XML parser. It's close enough to complete that I've decided lately to examine it's performance issues. It's a very slow parser, especially compared to C based parsers (expat, libxml2), mostly because Perl itself is very slow at this task. So I've decided to post a breakdown of where I'm at, what I've done and hopefully get some ideas for further improvements.

First of all, this parser is character based. This means it parses based on seeing a single character in a stream at a time, and backtracks when it needs to. This makes things slow in Perl, because ironically while Perl deals with strings very quickly, it sucks at dealing with characters. I'm not going to change this architectural decision because it would require a complete re-write of the parser (which is already 3000 lines of code in total).

Anyway, down to the nitty gritty. I'm going to simply post what dprofpp tells me are the most time consuming functions, and then post those functions so you can take a look.

dprofpp output from dprofpp -O 6:

Total Elapsed Time = 9.712727 Seconds User+System Time = 9.672727 Seconds Exclusive Times %Time ExclSec CumulS #Calls sec/call Csec/c Name 26.2 2.538 4.018 86883 0.0000 0.0000 XML::SAX::PurePerl::Reade +r::nextchar 24.5 2.379 4.518 60628 0.0000 0.0001 XML::SAX::PurePerl::Reade +r::match_not 23.3 2.259 1.999 86884 0.0000 0.0000 XML::SAX::PurePerl::Reade +r::Stream::next 11.3 1.099 5.442 4826 0.0002 0.0011 XML::SAX::PurePerl::Reade +r::consume_not 9.72 0.940 1.415 53003 0.0000 0.0000 XML::SAX::PurePerl::Reade +r::match_char 9.60 0.929 1.438 19261 0.0000 0.0001 XML::SAX::PurePerl::Reade +r::match_re
Note that the percentages are a little confusing, since some of those call each other, so there's probably some overlap - I don't know how to get dprofpp to make that a non-issue (i.e. only time spent *actually* in the subroutine, not time spent calling other subs. I don't even think its possible to calculate that).

Anyway, here's the code for those six. Go for your life - shoot me if I've done something stupid. Don't post without running Benchmark.pm on things though - I've already spent a few days on this!

# get the next utf-8 character from the input stream # and encode as UTF8 if possible/necessary sub nextchar { my $self = shift; $self->next; return unless defined $self->[CURRENT]; if ($self->[CURRENT] eq "\x0D") { $self->next; return unless defined($self->[CURRENT]); if ($self->[CURRENT] ne "\x0A") { $self->buffer("\x0A"); } } return unless $self->[ENCODING]; my $n = ord($self->[CURRENT]); # warn(sprintf("ch: 0x%x ($self->[CURRENT])\n", $n)); if (($] < 5.007002) && ($n > 0x7F)) { # utf8 surrogate my $current = $self->[CURRENT]; if ($n >= 0xFC) { # read 5 chars $self->next; $current .= $self->[CURRENT]; $self->next; $current .= $self->[CURRENT]; $self->next; $current .= $self->[CURRENT]; $self->next; $current .= $self->[CURRENT]; $self->next; $current .= $self->[CURRENT]; } elsif ($n >= 0xF8) { # read 4 chars $self->next; $current .= $self->[CURRENT]; $self->next; $current .= $self->[CURRENT]; $self->next; $current .= $self->[CURRENT]; $self->next; $current .= $self->[CURRENT]; } elsif ($n >= 0xF0) { # read 3 chars $self->next; $current .= $self->[CURRENT]; $self->next; $current .= $self->[CURRENT]; $self->next; $current .= $self->[CURRENT]; } elsif ($n >= 0xE0) { # read 2 chars $self->next; $current .= $self->[CURRENT]; $self->next; $current .= $self->[CURRENT]; } elsif ($n >= 0xC0) { # read 1 char $self->next; $current .= $self->[CURRENT]; } else { throw XML::SAX::Exception::Parse( Message => sprintf("Invalid character 0x%x", $n), ColumnNumber => $self->column, LineNumber => $self->line, PublicId => $self->public_id, SystemId => $self->system_id, ); } if ($] >= 5.006001) { $self->[CURRENT] = pack("U0A*", $current); } else { $self->[CURRENT] = $current; } } } ###################################################### # match anything *not* in the list of characters given sub match_not { my $self = shift; my $current = $self->[CURRENT]; return 0 unless defined $current; for my $m (@_) { if ($current eq $m) { $self->[MATCHED] = ''; return 0; } } $self->[MATCHED] = $current; $self->nextchar; return 1; } ###################################################### # get the next byte or character (character on 5.7.2+ # or byte on 5.5 or 5.6) from a fh, or from the buffer # if one exists. sub next { my $self = shift; # check for chars in buffer first. if (length($self->[BUFFER])) { return $self->[CURRENT] = substr($self->[BUFFER], 0, 1, ''); # + last param truncates buffer } if (length($self->[INTERNAL_BUFFER])) { BUFFERED_READ: $self->[CURRENT] = substr($self->[INTERNAL_BUFFER], 0, 1, ''); if ($self->[CURRENT] eq "\x0A") { $self->[LINE]++; $self->[COLUMN] = 1; } else { $self->[COLUMN]++ } return; } my $bytesread = read($self->[FH], $self->[INTERNAL_BUFFER], $self- +>[BUFFER_SIZE]); if ($bytesread) { # yes, goto. If you have a faster way feel free! goto BUFFERED_READ; } elsif (defined($bytesread)) { $self->[EOF]++; return $self->[CURRENT] = undef; } throw XML::SAX::Exception::Parse( Message => "Error reading from filehandle: $!", ); } ########################################################## # consume everything not matching chars listed in @_ sub consume_not { my $self = shift; my $consumed = ''; while(!$self->[EOF] && $self->match_not(@_)) { $consumed .= $self->[MATCHED]; } return length($self->[CONSUMED] = $consumed); } ######################################################## # Does the current token match a single character? sub match_char { my $self = shift; if (defined($self->[CURRENT]) && $self->[CURRENT] eq $_[0]) { $self->[MATCHED] = $_[0]; $self->nextchar; return 1; } $self->[MATCHED] = ''; return 0; } ######################################################### # Does the current token match a single regexp? sub match_re { my $self = shift; if ($self->[CURRENT] =~ $_[0]) { $self->[MATCHED] = $self->[CURRENT]; $self->nextchar; return 1; } $self->[MATCHED] = ''; return 0; }
The original code is on CPAN, but not with these performance tweaks. If you feel you need the live code (i.e. the code you see above along with the rest of it) to look at it's available from anon-cvs via: cvs -z3 -d :pserver:anonymous@axkit.org:/home/cvs login/co XML-SAX

Edit Petruchio Wed Feb 6 08:16:30 UTC 2002 - Added READMORE Tag

Comment on XML::SAX::PurePerl Performance
Select or Download Code
Re (tilly) 1: XML::SAX::PurePerl Performance
by tilly (Archbishop) on Feb 05, 2002 at 12:40 UTC
    If you are aiming for performance, my first question would be whether you need next and nextchar to be overridable. Write them as function calls and you will get a significant speedup. Since the common path when you call next should be that the next character is in the buffer, you can get a further speedup by moving the buffer check into nextchar and skipping the call entirely. Adding this logic should be a speedup even if the logic is left alone in the original function because of the relative cost of an if and a function call.

    As for the internals of the next method, I haven't run benchmark, but the basic approaches I would consider are the substr approach, splitting the string into an internal array you shift off, and using a /(.)/gs pattern match. (Warning on the last. Use of $' etc anywhere will slow that down. This may be a good reason to avoid no matter what benchmarking says.) I assume you have tried all of them? (Probably but it doesn't hurt to check.)

    An incidental note. Your goto can be removed from next with perceivable performance change. Make the if condition be empty, put an else around throwing the exception, and then move the BUFFERED_READ section after the decision logic. This should also be marginally faster because Perl doesn't have to spent time figuring out where the goto goes. (That shouldn't be the common path though, so the change should be marginal.)

    I am betting that 5.005 performance is not a priority of yours. But if you are allowing the logic to skip going to next anyways, you can replace a lot of the 5.005 logic with something like the following (untested):

    my $len = $n >= 0xFC ? 5 : $n >= 0xF8 ? 4 : $n >= 0xF0 ? 3 : $n >= 0xE0 ? 2 : $n >= 0xC0 ? 1 : throw XML::SAX::Exception::Parse( Message => sprintf("Invalid character 0x%x", $n), ColumnNumber => $self->column, LineNumber => $self->line, PublicId => $self->public_id, SystemId => $self->system_id, ); if ($len <= length($self->[BUFFER])) { $current .= substr($self->[BUFFER], 0, $len, ''); $self->[CURRENT] = substr($self->[BUFFER], -1); } else { $len -= length($self->[BUFFER]); $current .= $self->[BUFFER]; $self->[BUFFER] = ''; while (-1 < --$len) { next($self); $current .= $self->[CURRENT]; } }
    Again, it is more complex, but the fact you avoid a whole series of function calls should be a significant speedup.
      If you are aiming for performance, my first question would be whether you need next and nextchar to be overridable. Write them as function calls and you will get a significant speedup.

      Actually that's more of a myth than a truism. I did try it, but the speedup wasn't significant, though it varies from perl to perl.

      However I think there may be some value in the buffer check being moved to nextchar. I was originally thinking it needed to be in next() because sometimes next() is called on its own (for the encoding detection routines which need a byte-by-byte view), but that read()s in character by character anyway, so it might be a reasonable optimisation.

      You're right, I have tried all of the various "give me a character" methods, and substr() comes out on top. Which is a pain in the ass really - it's one point where Perl loses out to python where you can do string[0] to get the first character, just like you can in C.

      I'll leave the goto as is for now. It's not as bad as people make out - it's only bad when it's used for all flow control, and I think it's intention is quite clear here. Plus given a 1024 buffer, it's only part of the path 70 times in the parse of this 70K XML file.

      I do like that last optimisation though - that coupled with moving the buffer test into nextchar() might make a big difference (though maybe not as I don't think the particular test file in question has any UTF-8 characters in it). I'll try it and come back and let you know.

        YMMV, but when I tested it on my machine the popular myth was correct, a method call was massively slower than a function call. (Of course if you do anything interesting in your function...)

        Speaking of functions, I am kind of wondering what the purpose of next is. It seems from the code I see in it that it was intended to keep track of things like line numbers. But I don't see the rest of the code that would be needed to do that. (Sign of a change in design?) If that is the case, then what next is really providing is buffering.

        But isn't buffering exactly what read is supposed to do for you? OK, its speed is highly platform (and compilation option) dependent. But it seems to me that either you are better off using read, or else next should remove the additional buffering by using sysread.

Re: XML::SAX::PurePerl Performance
by dash2 (Hermit) on Feb 05, 2002 at 13:00 UTC
    As a sidenote, the -E option for dprofpp (which is on by default) excludes child subroutines from the figures for each subroutine. So I don't think you need to worry about that, unless maybe you have an old dprofpp.

    dave hj~

      I thought so too, until I added up the percentages.
Re: XML::SAX::PurePerl Performance
by theorbtwo (Prior) on Feb 05, 2002 at 13:11 UTC

    Hola, Matts, all.

    This probably isn't as big an improvement as tilly's above, but you can avoid a branch if you check $] at compiletime using a BEGIN block and play a little symref magic to have two different nextchar routines.

    Also, though using lexical variables is generaly a win because of maintainablity, if you're trying to get the last ounce of performance, try declaring your 'my' variables at the widest scope you can (file scope), and seeing how much that helps. IIRC, creating a lexical is expensive.

    Try running a line-level profiler, or even running a profiler on perl instead of your program.

    TACCTGTTTGAGTGTAACAATCATTCGCTCGGTGTATCCATCTTTG ACACAATGAATCTTTGACTCGAACAATCGTTCGGTCGCTCCGACGC
Re: XML::SAX::PurePerl Performance
by Zaxo (Archbishop) on Feb 05, 2002 at 16:39 UTC

    For comparison, here are the pre-optimization figures for perl-5.6.1, XML::SAX::PurePerl-0.08 from CPAN:

    $ perl -d:DProf t/16large.t 1..3 ok 1 ok 2 parsed 80085 bytes in 13 seconds ok 3 $ dprofpp -O 6 Total Elapsed Time = 11.05672 Seconds User+System Time = 10.67672 Seconds Exclusive Times %Time ExclSec CumulS #Calls sec/call Csec/c Name 36.9 3.940 3.892 115004 0.0000 0.0000 XML::SAX::PurePerl::Reade +r::match_nonext 14.8 1.590 1.547 86892 0.0000 0.0000 XML::SAX::PurePerl::Reade +r::Stream::next 10.8 1.160 8.197 114995 0.0000 0.0001 XML::SAX::PurePerl::Reade +r::match 9.55 1.020 7.152 115004 0.0000 0.0001 XML::SAX::PurePerl::Reade +r::match_nocheck 9.27 0.990 2.450 86891 0.0000 0.0000 XML::SAX::PurePerl::Reade +r::nextchar 9.05 0.966 8.389 17607 0.0001 0.0005 XML::SAX::PurePerl::Reade +r::consume

    Same machine, same perl, cvs version of fifteen minutes ago:

    $ perl -d:DProf t/16large.t 1..3 ok 1 ok 2 parsed 80085 bytes in 16 seconds ok 3 $ dprofpp -O 6 Total Elapsed Time = 13.08672 Seconds User+System Time = 11.91672 Seconds Exclusive Times %Time ExclSec CumulS #Calls sec/call Csec/c Name 37.5 4.480 4.442 115004 0.0000 0.0000 XML::SAX::PurePerl::Reade +r::match_nonext 13.2 1.580 1.537 86892 0.0000 0.0000 XML::SAX::PurePerl::Reade +r::Stream::next 10.6 1.270 9.037 114995 0.0000 0.0001 XML::SAX::PurePerl::Reade +r::match 10.2 1.220 7.882 115004 0.0000 0.0001 XML::SAX::PurePerl::Reade +r::match_nocheck 9.02 1.075 9.239 17607 0.0001 0.0005 XML::SAX::PurePerl::Reade +r::consume 8.48 1.010 2.460 86891 0.0000 0.0000 XML::SAX::PurePerl::Reade +r::nextchar
    It looks like a pessimization for for this machine. Both build directories were on the same spindle and partition.

    After Compline,
    Zaxo

Re: XML::SAX::PurePerl Performance
by chromatic (Archbishop) on Feb 05, 2002 at 21:26 UTC
    Looking at all those next() calls makes me wonder if you should add an optional $length parameter to get the next several bytes/characters. Collapsing three to five method dispatches and their concatenations seems like a potential improvement.

    I also wonder if it would help to emulate walking through a string with a char pointer as in C -- instead of substr, keep around a variable holding the position in the buffer.

    If you think either of these could help, I'd be happy to grab the CVS and play around a bit.

Re: XML::SAX::PurePerl Performance
by Juerd (Abbot) on Feb 08, 2002 at 21:22 UTC
    #!/usr/bin/perl use strict; use warnings; use Benchmark qw/cmpthese/; print "== Single subroutine argument ==\n"; sub b1_shift { my $x = shift; $x . ''; } sub b1_list { my ($x) = @_; $x . ''; } sub b1_direct { $_[0] . ''; } cmpthese( -5, { 'shift' => sub { b1_shift ('X' x rand 5000) }, 'list assignment' => sub { b1_list ('X' x rand 5000) }, 'direct' => sub { b1_direct('X' x rand 5000) } }); print "== Two subroutine arguments ==\n"; sub b2_shift { my $x = shift; my $y = shift; $x . $y; } sub b2_list { my ($x, $y) = @_; $x . $y; } sub b2_direct { $_[0] . $_[1]; } cmpthese( -5, { 'shift' => sub { b2_shift ('X' x rand 5000, 'Y' x rand 5 +000) }, 'list assignment' => sub { b2_list ('X' x rand 5000, 'Y' x rand 5 +000) }, 'direct' => sub { b2_direct('X' x rand 5000, 'Y' x rand 5 +000) } }); __END__ == Single subroutine argument == Benchmark: running direct, list assignment, shift, each for at least 5 + CPU seconds... direct: 5 wallclock secs ( 5.10 usr + 0.02 sys = 5.12 CPU) @ 11 +0247.85/s (n=564469) list assignment: 5 wallclock secs ( 5.14 usr + 0.03 sys = 5.17 CPU) + @ 89861.90/s (n=464586) shift: 6 wallclock secs ( 5.72 usr + 0.03 sys = 5.75 CPU) @ 70 +120.00/s (n=403190) Rate shift list assignment dire +ct shift 70120/s -- -22% -3 +6% list assignment 89862/s 28% -- -1 +8% direct 110248/s 57% 23% +-- == Two subroutine arguments == Benchmark: running direct, list assignment, shift, each for at least 5 + CPU seconds... direct: 5 wallclock secs ( 5.46 usr + 0.00 sys = 5.46 CPU) @ 57 +434.25/s (n=313591) list assignment: 6 wallclock secs ( 5.30 usr + 0.02 sys = 5.32 CPU) + @ 46535.53/s (n=247569) shift: 6 wallclock secs ( 5.00 usr + 0.01 sys = 5.01 CPU) @ 42 +526.95/s (n=213060) Rate shift list assignment direc +t shift 42527/s -- -9% -26 +% list assignment 46536/s 9% -- -19 +% direct 57434/s 35% 23% - +-

    2;0 juerd@ouranos:~$ perl -e'undef christmas' Segmentation fault 2;139 juerd@ouranos:~$

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://143416]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others browsing the Monastery: (8)
As of 2014-12-18 04:54 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (41 votes), past polls