|Pathologically Eclectic Rubbish Lister|
Faster utf8 escaping.by kyle (Abbot)
|on Apr 07, 2009 at 16:43 UTC||Need Help??|
Finding the problem
I used Apache::DProf to find why apache was sitting on the CPU instead of responding. This was as easy as adding "PerlModule Apache::DProf" to the httpd.conf and restarting Apache. (Actually, it then complained about a missing directory which I had to create—always check your error logs!) After that, I ran the offending request and looked for the huge tmon.out file. Analyzing that was as described in Profiling your code.
Having narrowed it down to the escaping (which I never would have expected to be a problem), I ran that in a little test script under Devel::NYTProf. That led me straight to this loop:
It loops through the entire string one character at a time, building the new string in reverse while destroying the old string. Charming.
My colleagues got a significant speed boost just by changing the loop to this:
I haven't dived into the guts of this, but I suspect it gets its speed from two things.
The only down side is that it may use more memory as it's really copying the string instead of building one out of the other.
Looking around at the code in question, I noticed a few things.
So I thought of writing something that assumes its input is utf8 so it doesn't have to work on anything else.
I also tried to use Unicode::Escape as a baseline to figure out how to escape everything, but I gave up on that when it turns out "\xc2\xc2" is "\\ufffd\\ufffd" and "\x80" is "\\ufffd" but "\xc2\xc2\x80" is "\\ufffd\\u0080".
I finally wrote the following based on my reading of UTF-8.
Checking my work
I did functional testing using a string from the Unicode::Escape tests and also some I cooked up on my own. I basically check it by confirming that it gets the same thing the original does, given valid inputs. I also did the same kind of confirmation using several hundred documents from our database.
I compared their speed using Benchmark to confirm that my new solution really does go faster.
The results are about what I expected. The new solution is dramatically faster in the case where it doesn't have to do anything (i.e., it's all seven bit ASCII) and still a lot faster in its worst case of doing a lot of work.
Faster faster faster?
Alert monks may notice that my solution scans the input string five times. Could we match the various multi-byte encodings with a single regular expression and eliminate some of those scans? I tried that.
This turns out not to be faster. I'm guessing the pattern with alternation takes a lot longer than the more static patterns I have in the multiple scans. I also wrote one that uses length and substr in case split is more expensive than I think it is. That wasn't any faster either.
I didn't know the details of how UTF-8 was encoded before I started working on this. I still am confused by some of the Unicode::Escape behavior I see to the point that I wonder if I've done something wrong. My only comfort is the fact that I know my code does the same as the old code when operating on real data.
In spite of not finding a better way, I still dislike the smell of doing many passes over my input. I'm not about to use a while loop, as Unicode::Escape does, to look over the string one piece at a time, but I haven't shaken the idea that there could be a regular expression that would do the job faster.
I've learned once again that I can't just eyeball code and figure out where it's slow. Profiling nearly always surprises me.
Code that's specialized to the task can make assumptions and take shortcuts that general purpose code can't. When there's a performance problem, it's worth throwing out "general solution" code in favor of something faster with less potential for reuse.
Lessons still to learn
I've found already that when I come to the monks with a tale such as this, I learn even more still. The monks are generous and humbling. I welcome your comments!