In a sane world, this would have spit out "a b c o o p s h u h ?" and I could have gone along my merry way. Ah, but would fun would that have been? Instead of that alltogether predictable and boring output, what I got was "a b c h u h ?"$str = 'a b c o o p s h u h ? '; print $& while $str =~ /.\s/g;
To witness this result for yourself, you will probably have to click the link to download the code (though you can copy-paste it under Opera, which makes it seem all the more spooky, you can't under IE, and I'm not sure about Mozilla...) Actually, you may not see a problem with this code at all, depending on your screen font.
If you've been paying attention, you've quite probably figured out what my problem was. Some of the spaces in $str were not actually spaces. They looked like spaces, but they were actually ASCII character 0xA0's.
How did that character get there? In my code, $str came from a parsed web page. "What kind of deranged webmonkey would use high-bit ASCII characters masquerading as spaces in their HTML?" I wondered. I checked the source of the page and it was not possessed with any such evil characters. It did, however, have the seemingly innocuous entities, ' ' where the evil spaces were in my parsed HTML.
"Aha! I've discovered a bug in the HTML parser!", I happily exclaimed. Tracing through the code of this module lead me to HTML::Entities, wherein I saw that ' ' was indeed decoded as character 0xA0.
The following snippet demonstrates this behaviour quite well (copy and paste at your leisure):
So was this a bug? As it turns out, ' ' is decoded exactly as it should be according to HTML specs, into ASCII character 0xA0. This is not the space many of us know, love, and expect. This is a wanton doppelganger space, which looks like a space, copies like a space, pastes like a space, and spaces like a space, but is not, in any true sense, a space.use HTML::Entities; my $html = 'whoa dude whats going on with this line'; my $text = HTML::Entities::decode_entities($html); print "$text\n"; ## prints: whoa dude whats going on with this line print $& while $text =~ /\w+\s+/g; ## prints: whoa dude whats with this
I don't very much care for this "non-breaking space", as it's called. My meditation, feeble thought it may be, is this: unless you should have some specific want or need of this bastard space, exorcise it early (tr/\240/ /) from all of your entity-decoded inbound HTML.
MeowChow s aamecha.s a..a\u$&owag.print