Spaced Out

There I was, slurping data from web pages near and far, slinging regexes and treebuilders, elements and parsers, laughing at the remarkable ease with which Perl and my armada of CPAN modules untangled the gnarliest messes of unstructured data. Indeed, all was going quite well, until I spent several hours belaboring what reduced to the following code:

$str = 'a b c o o p s h u h ? ';
print $& while $str =~ /.\s/g;
[download]

In a sane world, this would have spit out "a b c o o p s h u h ?" and I could have gone along my merry way. Ah, but would fun would that have been? Instead of that alltogether predictable and boring output, what I got was "a b c h u h ?"

To witness this result for yourself, you will probably have to click the link to download the code (though you can copy-paste it under Opera, which makes it seem all the more spooky, you can't under IE, and I'm not sure about Mozilla...) Actually, you may not see a problem with this code at all, depending on your screen font.

If you've been paying attention, you've quite probably figured out what my problem was. Some of the spaces in $str were not actually spaces. They looked like spaces, but they were actually ASCII character 0xA0's.

How did that character get there? In my code, $str came from a parsed web page. "What kind of deranged webmonkey would use high-bit ASCII characters masquerading as spaces in their HTML?" I wondered. I checked the source of the page and it was not possessed with any such evil characters. It did, however, have the seemingly innocuous entities, ' ' where the evil spaces were in my parsed HTML.

"Aha! I've discovered a bug in the HTML parser!", I happily exclaimed. Tracing through the code of this module lead me to HTML::Entities, wherein I saw that ' ' was indeed decoded as character 0xA0.

The following snippet demonstrates this behaviour quite well (copy and paste at your leisure):

use HTML::Entities;
my $html = 'whoa dude whats going&nbsp;on&nbsp;with this line';
my $text = HTML::Entities::decode_entities($html);

print "$text\n";
  ## prints: whoa dude whats going on with this line
print $& while $text =~ /\w+\s+/g;
  ## prints: whoa dude whats with this
[download]

So was this a bug? As it turns out, ' ' is decoded exactly as it should be according to HTML specs, into ASCII character 0xA0. This is not the space many of us know, love, and expect. This is a wanton doppelganger space, which looks like a space, copies like a space, pastes like a space, and spaces like a space, but is not, in any true sense, a space.

I don't very much care for this "non-breaking space", as it's called. My meditation, feeble thought it may be, is this: unless you should have some specific want or need of this bastard space, exorcise it early (tr/\240/ /) from all of your entity-decoded inbound HTML.

   MeowChow                                   
               s aamecha.s a..a\u$&owag.print

Comment on Spaced Out Select or Download Code

Replies are listed 'Best First'.
(jcwren) Re: Spaced Out by jcwren (Prior) on Feb 22, 2001 at 22:07 UTC
Occasionally, you'll also find a document that uses "Microsoft Smart-Quotes" or somesuch garbage. These are a result of people creating web pages in MS-Word and MS-Powerpoint. It's most often evident in single and double quoting. 0x93 and 0x94 are the pairs for double quotes, don't know about single-quotes off hand. I'm a little unclear what the whole point of "Smart-Quotes" is, but they break things. In particular, they render as garbage on most non-MS browsers. Rather than duplicate some already good information, here's a link to an explanation, and a Perl program that strips the offending characters out. --Chris e-mail jcwren	[reply]
Re: (jcwren) Re: Spaced Out by TStanley (Canon) on Feb 22, 2001 at 22:49 UTC
You can also include the Composer feature in Netscape Communicator. This also has the nasty habit of adding the & nbsp character in the code as well. TStanley In the end, there can be only one!	[reply]
Re: (jcwren) Re: Spaced Out by elusion (Curate) on Feb 23, 2001 at 01:10 UTC
As to the Microsoft Products, I'm sure it's some evil plan to try and establish their "dominance". What better way to do this than invent "technology" and have it implemented in only your products? I do use the `' '` once in a while on my pages, but with good reason. With Netscape at least, empty tables will disappear without one of the characters in the table. I use colored tables on my pages quite extensively instead of horizontal rules, and this is necessary to do the job. - p u n k k i d "Reality is merely an illusion, albeit a very persistent one." -Albert Einstein Update: Sorry folks, didn't read all the posts before I commented about the tables	[reply]
Re: Spaced Out by SilverB1rd (Scribe) on Feb 22, 2001 at 22:09 UTC
This "non-breaking space" is vary useful when working with tables. But programs like Dreamweaver use them like theres no tomorrow. The nbsp is in 'theory' a space that the word wraper does not see as a space. It sounds like the nbsp is not to blame but rather the html praser.	[reply]
Re: Spaced Out by Adam (Vicar) on Feb 22, 2001 at 21:56 UTC
Very enlightening. I wonder though, perhaps future iterations of Perl ought to recognize xA0 as a member of \s.	[reply]
Re^2: Spaced Out by tadman (Prior) on Feb 24, 2001 at 04:13 UTC
It's probably best left optional since the entire idea of having non-breaking spaces, is, not surprisingly, to prevent the breakage of something. Although admittedly overused on the Web at large, the principle is to provide a visual space between two 'words' which aren't meant to be separated. As such, something like 'Perl Monks' should not be treated as two words, but rather, as a single word. Including 0xA0 in \s would defeat the entire purpose of having   in the first place. Instead, you could build methods into HTML::Message to strip out these invisible buggers, which is really only a single tr/\xA0/\x20/ operation anyway. You might find that 0xA0 isn't the only "invisible" character out there either, as it depends on the font that you are using, and will likely vary from UNIX to Windows to Macintosh. Sometimes if the font doesn't have a defined character for that position, it draws nothing, a zero width non-character that is there, but not.	[reply]
don't change \s semantics by grinder (Bishop) on Feb 23, 2001 at 15:29 UTC
That's not a very smart idea. Under a certain commercial operating system that shall remain nameless, 0xa0 maps to á. I have some scripts that would be seriously bent by such a change in semantics of `\s`. OTOH, it would be very nice to be able to define you own idea of what \s (and cohorts) should represent... I can't count the number of times I match `[A-Za-z0-9]` because I don't want the underscore. I know I can match `[^_\w]` but people find that a little obfuscated around here. (clarification: where here means where I work, not the monastery). `<update date="2005-01-08">` Note that the `[^_\W]` trick does not work as expected with 5.8 when Unicode comes in to play...`</update>`	[reply] [d/l]
Re: Spaced Out by clemburg (Curate) on Feb 23, 2001 at 15:22 UTC
Thanks for this excellent and illuminating meditation. OTOH, I feel there's no need to call for the perl porters to include this little wishlet into their big task list. It's trivial to work around this problem (once you know it exists, admitted - that's something for the HTML::Parser docs, or maybe the Perl FAQ). And it even looks more readable: `use HTML::Entities; my $html = ' '; my $text = HTML::Entities::decode_entities($html); print "\$text = '$text'\n"; my $whitespace = "[\s\240]"; if ($text =~ /^$whitespace$/) { print "yes\n"; } else { print "no\n"; }` [download] Christian Lemburg Brainbench MVP for Perl http://www.brainbench.com	[reply] [d/l]
Re: Spaced Out by dws (Chancellor) on Feb 23, 2001 at 00:05 UTC
There's a very good argument that what you've run into is a "requirements bug". The HTML entity ` ` exists for layout formatting. Other than making it possible for `HTML::Entity::encode()` and `HTML::Entity::decode()` to be symmetric, what purpose is served by translating ` ` into '\240'? Does anyone have any insight to share on this?	[reply] [d/l] [select]
Re: Spaced Out by Maclir (Curate) on Feb 23, 2001 at 03:09 UTC
An excellent piece of work. I must say that I tend to use teh non-breaking space in my HTML code, both to fill otherwise empty table cells, and for its true purpose - to insert a space between characters that I don't want wrapped over two lines - for example, writing out telephone numbers: Phone: +61 2 9999 3114 Mobile: 0407 265 199 My recommendation would be that the ascii character `0xA0` should be included in the whitespace character class (`\s`). Something for the Perl-5 porters?	[reply] [d/l] [select]
Re: Spaced Out by Chmrr (Vicar) on Feb 23, 2001 at 02:19 UTC
Well, it did exactly what it was advertised as doing -- it didn't break on that space. ;> Of course, this is yet anoother case of people using tags or special characters for what they look like, instead of what they mean. O, for an enforcable standard.. perl -e 'print "I love $^X$\"$]!$/"#$&V"+@( NO CARRIER'	[reply]


Perl Monk, Perl Meditation
	PerlMonks