Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Faster utf8 escaping.

by kyle (Abbot)
on Apr 07, 2009 at 16:43 UTC ( #756074=perlmeditation: print w/ replies, xml ) Need Help??

Our application at $work sometimes needs to pull a user's document out of the database and do some escaping for the benefit of JavaScript before delivery. For this, we used Unicode::Escape until recently when someone tried to work with an unusually large document and the web server spent an hour doing the escaping. This is my story of finding and fixing the problem.

Finding the problem

I used Apache::DProf to find why apache was sitting on the CPU instead of responding. This was as easy as adding "PerlModule Apache::DProf" to the httpd.conf and restarting Apache. (Actually, it then complained about a missing directory which I had to create—always check your error logs!) After that, I ran the offending request and looked for the huge tmon.out file. Analyzing that was as described in Profiling your code.

Having narrowed it down to the escaping (which I never would have expected to be a problem), I ran that in a little test script under Devel::NYTProf. That led me straight to this loop:

while(my $uchar = $us->chop) { my $utf8 = $uchar->utf8; $rslt = (($utf8 =~ /[\x80-\xff]/) ? '\\u'.unpack('H4', $uchar- +>utf16be) : $utf8) . $rslt; }

It loops through the entire string one character at a time, building the new string in reverse while destroying the old string. Charming.

Faster

My colleagues got a significant speed boost just by changing the loop to this:

while(my $uchar = $us->substr( $offset++, 1 )) { my $utf8 = $uchar->utf8; $rslt .= ($utf8 =~ /[\x80-\xff]/) ? '\\u'.unpack('H4', $uchar- +>utf16be) : $utf8; }

I haven't dived into the guts of this, but I suspect it gets its speed from two things.

  • It's appending to the new string instead of prepending. This probably avoids a lot of copying.
  • It's reading the old string instead of chopping it up. That avoids writes also.

The only down side is that it may use more memory as it's really copying the string instead of building one out of the other.

Faster faster

Looking around at the code in question, I noticed a few things.

  • We only ever use Unicode::Escape in one place.
  • It has facilities for lots of different character encodings.
  • We only ever feed it utf8.

So I thought of writing something that assumes its input is utf8 so it doesn't have to work on anything else.

My first approach was to build a hash that maps utf8 encoded strings to JavaScript escapes for those strings. Then I could do one big s/// replacement with quick little hash lookups instead of any function calls. I could hard code the data structures in a new module, and it would sit on some memory to gain speed. As it turns out, the table would have been too big for this to be a workable solution.

I also tried to use Unicode::Escape as a baseline to figure out how to escape everything, but I gave up on that when it turns out "\xc2\xc2" is "\\ufffd\\ufffd" and "\x80" is "\\ufffd" but "\xc2\xc2\x80" is "\\ufffd\\u0080".

I finally wrote the following based on my reading of UTF-8.

# This does what Unicode::Escape::escape does, except a lot less. # It assumes the input is utf8 encoded. # It assumes the input is defined. # Invalid utf8 is deleted silently. sub utf8_escape { # Short circuit if it's all seven bit ASCII. return $_[0] unless $_[0] =~ /[\x80-\xff]/; my $s = shift; $s =~ s{ ([\xc2-\xdf]) ([\x80-\xbf]) }{ '\\u' . sprintf( '%04x', ( ( 0b00011111 & ord $1 ) << 6 ) | ( 0b00111111 & ord $2 ) ) }exmsg; $s =~ s{ ([\xe0-\xef]) ([\x80-\xbf]) ([\x80-\xbf]) }{ '\\u' . sprintf( '%04x', ( ( 0b00001111 & ord $1 ) << 12 ) | ( ( 0b00111111 & ord $2 ) << 6 ) | ( 0b00111111 & ord $3 ) ) }exmsg; # valid utf8 that can't be encoded in \uXXXX $s =~ s{ [\xf0-\xf4] [\x80-\xbf]{3} }{\\ufffd}xmsg; # invalid utf8 $s =~ tr/\x80-\xff//d; return $s; }

Checking my work

I did functional testing using a string from the Unicode::Escape tests and also some I cooked up on my own. I basically check it by confirming that it gets the same thing the original does, given valid inputs. I also did the same kind of confirmation using several hundred documents from our database.

use strict; use warnings; use Test::More; use Unicode::Escape; my @test_texts = ( { text => "\x{e3}\x{81}\x{82}" . "\x{e3}\x{81}\x{84}" . "\x{e3}\x{81}\x{86}" . "\x{e3}\x{81}\x{88}" . "\x{e3}\x{81}\x{8a}", name => 'utf8 test from Unicode::Escape-0.0.2', }, { text => q{}, name => 'empty string', }, { text => '0', name => 'zero (false-looking)', }, { text => "\x{c2}\x{a2}" . "\x{c2}\x{a3}" . "\x{c2}\x{a4}", name => 'some two-byte utf8', }, # Unicode::Escape escapes this as '\udbea\udfcd' What's that? # { # text => "\x{f4}\x{8a}\x{af}\x{8d}", # name => 'four-byte utf8', # }, { text => "one: X, two: \x{c2}\x{a5}, three: \x{e3}\x{81}\x{8a}" +, name => 'mixed character length utf8', }, ); plan 'tests' => scalar @test_texts; foreach my $t ( @test_texts ) { die 'bad test data' if grep { ! defined $t->{$_} } qw( text name ) +; my $text = $t->{text}; my $canonical = Unicode::Escape::escape( $text ); my $test = utf8_escape( $text ); # I use 'ok' with 'eq' instead of 'is' so that a failure doesn't # puke a lot of unintelligible yuck. ok( $canonical eq $test, "correct escaping for '$t->{name}'" ); }

I compared their speed using Benchmark to confirm that my new solution really does go faster.

use strict; use warnings; use Unicode::Escape; use Benchmark qw( cmpthese timethese ); my $subs = { 'utf8esc' => sub { utf8_escape( $content ) }, 'Uni::Esc' => sub { Unicode::Escape::escape( $content ) }, }; my @texts = ( [ 'abc123ABC456' x 10_000, 'no utf8' ], [ "\x{e3}\x{81}\x{82}\x{e3}\x{81}\x{84}\x{e3}\x{81}\x{86}\x{e3}\x{ +81}\x{88}\x{e3}\x{81}\x{8a}" x 10_000, 'all 3-byte utf8' ], ); foreach my $txt ( @texts ) { $test_text = $txt->[0]; print "*** $txt->[1]\n"; cmpthese( -30, $subs ); }

The results are about what I expected. The new solution is dramatically faster in the case where it doesn't have to do anything (i.e., it's all seven bit ASCII) and still a lot faster in its worst case of doing a lot of work.

Faster faster faster?

Alert monks may notice that my solution scans the input string five times. Could we match the various multi-byte encodings with a single regular expression and eliminate some of those scans? I tried that.

sub utf8_escape_2 { my $s = shift; $s =~ s{ ( [\xc2-\xdf] [\x80-\xbf] | [\xe0-\xef] [\x80-\xbf]{2} | [\xf0-\xf4] [\x80-\xbf]{3} ) } { my @c = map ord, split //, $1; '\\u' . ( ( 2 == @c ) ? sprintf( '%04x', ( ( 0b00011111 & $c[0]) << 6 ) | ( 0b00111111 & $c[1] ) ) : ( 3 == @c ) ? sprintf( '%04x', ( ( 0b00001111 & $c[0] ) << 12 ) | ( ( 0b00111111 & $c[1] ) << 6 ) | ( 0b00111111 & $c[2] ) ) : 'fffd' ); }xemsg; $s =~ tr/\x80-\xff//d; # invalid utf8 return $s; }

This turns out not to be faster. I'm guessing the pattern with alternation takes a lot longer than the more static patterns I have in the multiple scans. I also wrote one that uses length and substr in case split is more expensive than I think it is. That wasn't any faster either.

Doubts

I didn't know the details of how UTF-8 was encoded before I started working on this. I still am confused by some of the Unicode::Escape behavior I see to the point that I wonder if I've done something wrong. My only comfort is the fact that I know my code does the same as the old code when operating on real data.

In spite of not finding a better way, I still dislike the smell of doing many passes over my input. I'm not about to use a while loop, as Unicode::Escape does, to look over the string one piece at a time, but I haven't shaken the idea that there could be a regular expression that would do the job faster.

Lessons learned

I've learned once again that I can't just eyeball code and figure out where it's slow. Profiling nearly always surprises me.

Code that's specialized to the task can make assumptions and take shortcuts that general purpose code can't. When there's a performance problem, it's worth throwing out "general solution" code in favor of something faster with less potential for reuse.

Lessons still to learn

I've found already that when I come to the monks with a tale such as this, I learn even more still. The monks are generous and humbling. I welcome your comments!

Comment on Faster utf8 escaping.
Select or Download Code
Re: Faster utf8 escaping.
by ikegami (Pope) on Apr 07, 2009 at 17:24 UTC

    Using C would also eliminate the need for all those passes. It's faster to do multiple passes in Perl since each pass isthree of the passes are entirely in C.

    Here's a trick that's been known to speed things up if most of your data is ASCII: Read four bytes at a time. If none of them have the high bit set, you've just handled 4 chars at once.

    The extra work would probably slow Perl down, but it probably speeds up a C solution. Be wary of alignments.

Re: Faster utf8 escaping.
by Porculus (Hermit) on Apr 07, 2009 at 20:54 UTC

    Another thing to bear in mind is that you can often get a long way with core modules that almost do what you want.

    For example, you were using Unicode::Escape because it promised to conveniently turn non-ASCII characters into Javascript escape sequences. And this made sense, because that's what you wanted to end up with. Did you look at the core Encode module, though? There's at least one way to use that to solve your problem, and in my tests -- using your test cases -- it comes out about 35% faster than your hand-rolled version, while also providing all the extra stuff Unicode::Escape does like handling non-UTF-8 encodings or invalid UTF-8.

      First, it's important to note that Encode::FB_XMLCREF doesn't work on from_to when using 5.8.8's Encode 2.12.

      $ perl -MEncode=encode,from_to -wle'$s = encode("UTF-8", chr(0x2660)); + from_to($s, "UTF-8", "ascii", Encode::FB_XMLCREF); print $s' | od -c 0000000 342 231 240 \n 0000004

      The problem appears to be fixed in 2.13. Using Encode 2.33:

      $ perl -MEncode=encode,from_to -wle'$s = encode("UTF-8", chr(0x2660)); + from_to($s, "UTF-8", "ascii", Encode::FB_XMLCREF); print $s' | od -c 0000000 & # x 2 6 6 0 ; \n 0000011

      So it's not a solid solution to being with. (You'd have to use decode+encode instead of from_to.) But it's buggy even with a bug-free version of Encode.

      1..1 not ok 1 - With XML/HTML 4 digit hex entity # Failed test 'With XML/HTML 4 digit hex entity' # at a.pl line 24. # got: '\u2660\u2660' # expected: '&#x2660;\u2660' # Looks like you failed 1 test of 1.

      Thank you! I like this solution. You're right, core modules can often get you most of the way to where you want to go.

      That said, it needed some modification to work with the larger battery of tests I have locally. The first thing I noticed is that Encode doesn't always produce a four digit hex entity. I took care of that by allowing the regular expression to match two or four characters. Then came the string that encoded as "Mar&#ed;a; F", so I made the "characters" match only hex digits.

      Now the replacement looks like this:

      $s =~ s/&#x([a-f0-9]{2})?([a-f0-9]{2});/'\\u' . ($1||'00') . $2/ieg;

      That still doesn't take care of the problem that ikegami raises. I figure I can preprocess with s/&/&x/g and then back with s/&x/&/g when it's done (or something), but then we're back up to five full scans over the input string. It might still be faster, but I haven't tested that yet.

      Thanks again.

Re: Faster utf8 escaping.
by ikegami (Pope) on Apr 07, 2009 at 22:26 UTC
    What if you were escaping the string "Why am I getting \ufffd?". Shouldn't you escape "\" as well?

      Shouldn't you escape "\" as well?

      I'd think so, but Unicode::Escape doesn't. In our code, we escape that and quotation marks before calling it.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlmeditation [id://756074]
Approved by ikegami
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others scrutinizing the Monastery: (13)
As of 2014-07-10 16:52 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    When choosing user names for websites, I prefer to use:








    Results (213 votes), past polls