comment on

Our application at $work sometimes needs to pull a user's document out of the database and do some escaping for the benefit of JavaScript before delivery. For this, we used Unicode::Escape until recently when someone tried to work with an unusually large document and the web server spent an hour doing the escaping. This is my story of finding and fixing the problem.

Finding the problem

I used Apache::DProf to find why apache was sitting on the CPU instead of responding. This was as easy as adding "PerlModule Apache::DProf" to the httpd.conf and restarting Apache. (Actually, it then complained about a missing directory which I had to create—always check your error logs!) After that, I ran the offending request and looked for the huge tmon.out file. Analyzing that was as described in Profiling your code.

Having narrowed it down to the escaping (which I never would have expected to be a problem), I ran that in a little test script under Devel::NYTProf. That led me straight to this loop:

    while(my $uchar = $us->chop) {
        my $utf8 = $uchar->utf8;
        $rslt = (($utf8 =~ /[\x80-\xff]/) ? '\\u'.unpack('H4', $uchar-
+>utf16be) : $utf8) . $rslt;
    }
[download]

It loops through the entire string one character at a time, building the new string in reverse while destroying the old string. Charming.

Faster

My colleagues got a significant speed boost just by changing the loop to this:

    while(my $uchar = $us->substr( $offset++, 1 )) {
        my $utf8 = $uchar->utf8;
        $rslt .= ($utf8 =~ /[\x80-\xff]/) ? '\\u'.unpack('H4', $uchar-
+>utf16be) : $utf8;
    }
[download]

I haven't dived into the guts of this, but I suspect it gets its speed from two things.

It's appending to the new string instead of prepending. This probably avoids a lot of copying.
It's reading the old string instead of chopping it up. That avoids writes also.

The only down side is that it may use more memory as it's really copying the string instead of building one out of the other.

Faster faster

Looking around at the code in question, I noticed a few things.

We only ever use Unicode::Escape in one place.
It has facilities for lots of different character encodings.
We only ever feed it utf8.

So I thought of writing something that assumes its input is utf8 so it doesn't have to work on anything else.

My first approach was to build a hash that maps utf8 encoded strings to JavaScript escapes for those strings. Then I could do one big s/// replacement with quick little hash lookups instead of any function calls. I could hard code the data structures in a new module, and it would sit on some memory to gain speed. As it turns out, the table would have been too big for this to be a workable solution.

I also tried to use Unicode::Escape as a baseline to figure out how to escape everything, but I gave up on that when it turns out "\xc2\xc2" is "\\ufffd\\ufffd" and "\x80" is "\\ufffd" but "\xc2\xc2\x80" is "\\ufffd\\u0080".

I finally wrote the following based on my reading of UTF-8.

# This does what Unicode::Escape::escape does, except a lot less.
# It assumes the input is utf8 encoded.
# It assumes the input is defined.
# Invalid utf8 is deleted silently.

sub utf8_escape {
    # Short circuit if it's all seven bit ASCII.
    return $_[0] unless $_[0] =~ /[\x80-\xff]/;

    my $s = shift;

    $s =~ s{ ([\xc2-\xdf]) ([\x80-\xbf]) }{
        '\\u' .
        sprintf( '%04x',
              ( ( 0b00011111 & ord $1 ) << 6 )
            |   ( 0b00111111 & ord $2 )
        )
    }exmsg;
    $s =~ s{ ([\xe0-\xef]) ([\x80-\xbf]) ([\x80-\xbf]) }{
        '\\u' . 
        sprintf( '%04x',
              ( ( 0b00001111 & ord $1 ) << 12 )
            | ( ( 0b00111111 & ord $2 ) <<  6 )
            |   ( 0b00111111 & ord $3 )
        )
    }exmsg;

    # valid utf8 that can't be encoded in \uXXXX
    $s =~ s{ [\xf0-\xf4] [\x80-\xbf]{3} }{\\ufffd}xmsg;

    # invalid utf8
    $s =~ tr/\x80-\xff//d;

    return $s;
}
[download]

Checking my work

I did functional testing using a string from the Unicode::Escape tests and also some I cooked up on my own. I basically check it by confirming that it gets the same thing the original does, given valid inputs. I also did the same kind of confirmation using several hundred documents from our database.

use strict;
use warnings;

use Test::More;

use Unicode::Escape;

my @test_texts = (
    {
        text => "\x{e3}\x{81}\x{82}"
              . "\x{e3}\x{81}\x{84}"
              . "\x{e3}\x{81}\x{86}"
              . "\x{e3}\x{81}\x{88}"
              . "\x{e3}\x{81}\x{8a}",
        name => 'utf8 test from Unicode::Escape-0.0.2',
    },
    {
        text => q{},
        name => 'empty string',
    },
    {
        text => '0',
        name => 'zero (false-looking)',
    },
    {
        text => "\x{c2}\x{a2}"
              . "\x{c2}\x{a3}"
              . "\x{c2}\x{a4}",
        name => 'some two-byte utf8',
    },
    # Unicode::Escape escapes this as '\udbea\udfcd'  What's that?
#    {
#        text => "\x{f4}\x{8a}\x{af}\x{8d}",
#        name => 'four-byte utf8',
#    },
    {
        text => "one: X, two: \x{c2}\x{a5}, three: \x{e3}\x{81}\x{8a}"
+,
        name => 'mixed character length utf8',
    },
);

plan 'tests' => scalar @test_texts;

foreach my $t ( @test_texts ) {
    die 'bad test data' if grep { ! defined $t->{$_} } qw( text name )
+;

    my $text      = $t->{text};
    my $canonical = Unicode::Escape::escape( $text );
    my $test      = utf8_escape( $text );

    # I use 'ok' with 'eq' instead of 'is' so that a failure doesn't
    # puke a lot of unintelligible yuck.
    ok( $canonical eq $test, "correct escaping for '$t->{name}'" );
}
[download]

I compared their speed using Benchmark to confirm that my new solution really does go faster.

use strict;
use warnings;

use Unicode::Escape;

use Benchmark qw( cmpthese timethese );

my $subs = {
    'utf8esc'    => sub { utf8_escape( $content ) },
    'Uni::Esc'   => sub { Unicode::Escape::escape( $content ) },
};

my @texts = (
    [ 'abc123ABC456' x 10_000, 'no utf8' ],
    [ "\x{e3}\x{81}\x{82}\x{e3}\x{81}\x{84}\x{e3}\x{81}\x{86}\x{e3}\x{
+81}\x{88}\x{e3}\x{81}\x{8a}" x 10_000, 'all 3-byte utf8' ],
    );

foreach my $txt ( @texts ) {
    $test_text = $txt->[0];
    print "*** $txt->[1]\n";
    cmpthese( -30, $subs );
}
[download]

The results are about what I expected. The new solution is dramatically faster in the case where it doesn't have to do anything (i.e., it's all seven bit ASCII) and still a lot faster in its worst case of doing a lot of work.

Faster faster faster?

Alert monks may notice that my solution scans the input string five times. Could we match the various multi-byte encodings with a single regular expression and eliminate some of those scans? I tried that.

sub utf8_escape_2 {
    my $s = shift;

    $s =~ s{
               (
                   [\xc2-\xdf] [\x80-\xbf]
                   |
                   [\xe0-\xef] [\x80-\xbf]{2}
                   |
                   [\xf0-\xf4] [\x80-\xbf]{3}
               )
           }
           {
               my @c = map ord, split //, $1;
               '\\u' .
               (
                  ( 2 == @c ) ?
                      sprintf( '%04x',
                               ( ( 0b00011111 & $c[0]) << 6 )
                               | ( 0b00111111 & $c[1] )
                      )
                : ( 3 == @c ) ?
                      sprintf( '%04x',
                               ( ( 0b00001111 & $c[0] ) << 12 )
                               | ( ( 0b00111111 & $c[1] ) << 6 )
                               | ( 0b00111111 & $c[2] )
                      )
                : 'fffd'
               );
           }xemsg;

    $s =~ tr/\x80-\xff//d; # invalid utf8

    return $s;
}
[download]

This turns out not to be faster. I'm guessing the pattern with alternation takes a lot longer than the more static patterns I have in the multiple scans. I also wrote one that uses length and substr in case split is more expensive than I think it is. That wasn't any faster either.

Doubts

I didn't know the details of how UTF-8 was encoded before I started working on this. I still am confused by some of the Unicode::Escape behavior I see to the point that I wonder if I've done something wrong. My only comfort is the fact that I know my code does the same as the old code when operating on real data.

In spite of not finding a better way, I still dislike the smell of doing many passes over my input. I'm not about to use a while loop, as Unicode::Escape does, to look over the string one piece at a time, but I haven't shaken the idea that there could be a regular expression that would do the job faster.

Lessons learned

I've learned once again that I can't just eyeball code and figure out where it's slow. Profiling nearly always surprises me.

Code that's specialized to the task can make assumptions and take shortcuts that general purpose code can't. When there's a performance problem, it's worth throwing out "general solution" code in favor of something faster with less potential for reuse.

Lessons still to learn

I've found already that when I come to the monks with a tale such as this, I learn even more still. The monks are generous and humbling. I welcome your comments!

In reply to Faster utf8 escaping. by kyle

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


go ahead... be a heretic
	PerlMonks