|Problems? Is your data what you think it is?|
Spaced Outby MeowChow (Vicar)
|on Feb 22, 2001 at 21:46 UTC||Need Help??|
There I was, slurping data from web pages near and far, slinging regexes and treebuilders, elements and parsers, laughing at the remarkable ease with which Perl and my armada of CPAN modules untangled the gnarliest messes of unstructured data. Indeed, all was going quite well, until I spent several hours belaboring what reduced to the following code:
In a sane world, this would have spit out "a b c o o p s h u h ?" and I could have gone along my merry way. Ah, but would fun would that have been? Instead of that alltogether predictable and boring output, what I got was "a b c h u h ?"
To witness this result for yourself, you will probably have to click the link to download the code (though you can copy-paste it under Opera, which makes it seem all the more spooky, you can't under IE, and I'm not sure about Mozilla...) Actually, you may not see a problem with this code at all, depending on your screen font.
If you've been paying attention, you've quite probably figured out what my problem was. Some of the spaces in $str were not actually spaces. They looked like spaces, but they were actually ASCII character 0xA0's.
How did that character get there? In my code, $str came from a parsed web page. "What kind of deranged webmonkey would use high-bit ASCII characters masquerading as spaces in their HTML?" I wondered. I checked the source of the page and it was not possessed with any such evil characters. It did, however, have the seemingly innocuous entities, ' ' where the evil spaces were in my parsed HTML.
"Aha! I've discovered a bug in the HTML parser!", I happily exclaimed. Tracing through the code of this module lead me to HTML::Entities, wherein I saw that ' ' was indeed decoded as character 0xA0.
The following snippet demonstrates this behaviour quite well (copy and paste at your leisure):
So was this a bug? As it turns out, ' ' is decoded exactly as it should be according to HTML specs, into ASCII character 0xA0. This is not the space many of us know, love, and expect. This is a wanton doppelganger space, which looks like a space, copies like a space, pastes like a space, and spaces like a space, but is not, in any true sense, a space.
I don't very much care for this "non-breaking space", as it's called. My meditation, feeble thought it may be, is this: unless you should have some specific want or need of this bastard space, exorcise it early (tr/\240/ /) from all of your entity-decoded inbound HTML.
MeowChow s aamecha.s a..a\u$&owag.print