Beefy Boxes and Bandwidth Generously Provided by pair Networks Joe
more useful options
 
PerlMonks  

Spaced Out

by MeowChow (Vicar)
on Feb 22, 2001 at 21:46 UTC ( #60270=perlmeditation: print w/ replies, xml ) Need Help??

There I was, slurping data from web pages near and far, slinging regexes and treebuilders, elements and parsers, laughing at the remarkable ease with which Perl and my armada of CPAN modules untangled the gnarliest messes of unstructured data. Indeed, all was going quite well, until I spent several hours belaboring what reduced to the following code:
$str = 'a b c o o p s h u h ? '; print $& while $str =~ /.\s/g;
In a sane world, this would have spit out "a b c o o p s h u h ?" and I could have gone along my merry way. Ah, but would fun would that have been? Instead of that alltogether predictable and boring output, what I got was "a b c h u h ?"

To witness this result for yourself, you will probably have to click the link to download the code (though you can copy-paste it under Opera, which makes it seem all the more spooky, you can't under IE, and I'm not sure about Mozilla...) Actually, you may not see a problem with this code at all, depending on your screen font.

If you've been paying attention, you've quite probably figured out what my problem was. Some of the spaces in $str were not actually spaces. They looked like spaces, but they were actually ASCII character 0xA0's.

How did that character get there? In my code, $str came from a parsed web page. "What kind of deranged webmonkey would use high-bit ASCII characters masquerading as spaces in their HTML?" I wondered. I checked the source of the page and it was not possessed with any such evil characters. It did, however, have the seemingly innocuous entities, ' ' where the evil spaces were in my parsed HTML.

"Aha! I've discovered a bug in the HTML parser!", I happily exclaimed. Tracing through the code of this module lead me to HTML::Entities, wherein I saw that ' ' was indeed decoded as character 0xA0.

The following snippet demonstrates this behaviour quite well (copy and paste at your leisure):

use HTML::Entities; my $html = 'whoa dude whats going on with this line'; my $text = HTML::Entities::decode_entities($html); print "$text\n"; ## prints: whoa dude whats going on with this line print $& while $text =~ /\w+\s+/g; ## prints: whoa dude whats with this
So was this a bug? As it turns out, ' ' is decoded exactly as it should be according to HTML specs, into ASCII character 0xA0. This is not the space many of us know, love, and expect. This is a wanton doppelganger space, which looks like a space, copies like a space, pastes like a space, and spaces like a space, but is not, in any true sense, a space.

I don't very much care for this "non-breaking space", as it's called. My meditation, feeble thought it may be, is this: unless you should have some specific want or need of this bastard space, exorcise it early (tr/\240/ /) from all of your entity-decoded inbound HTML.

   MeowChow                                   
               s aamecha.s a..a\u$&owag.print

Comment on Spaced Out
Select or Download Code
Re: Spaced Out
by Adam (Vicar) on Feb 22, 2001 at 21:56 UTC
    Very enlightening. I wonder though, perhaps future iterations of Perl ought to recognize xA0 as a member of \s.

      That's not a very smart idea. Under a certain commercial operating system that shall remain nameless, 0xa0 maps to á.

      I have some scripts that would be seriously bent by such a change in semantics of \s.

      OTOH, it would be very nice to be able to define you own idea of what \s (and cohorts) should represent... I can't count the number of times I match [A-Za-z0-9] because I don't want the underscore. I know I can match [^_\w] but people find that a little obfuscated around here. (clarification: where here means where I work, not the monastery).

      <update date="2005-01-08"> Note that the [^_\W] trick does not work as expected with 5.8 when Unicode comes in to play...</update>

      It's probably best left optional since the entire idea of having non-breaking spaces, is, not surprisingly, to prevent the breakage of something. Although admittedly overused on the Web at large, the principle is to provide a visual space between two 'words' which aren't meant to be separated. As such, something like 'Perl&nbsp;Monks' should not be treated as two words, but rather, as a single word. Including 0xA0 in \s would defeat the entire purpose of having &nbsp; in the first place.

      Instead, you could build methods into HTML::Message to strip out these invisible buggers, which is really only a single tr/\xA0/\x20/ operation anyway.

      You might find that 0xA0 isn't the only "invisible" character out there either, as it depends on the font that you are using, and will likely vary from UNIX to Windows to Macintosh. Sometimes if the font doesn't have a defined character for that position, it draws nothing, a zero width non-character that is there, but not.
(jcwren) Re: Spaced Out
by jcwren (Prior) on Feb 22, 2001 at 22:07 UTC

    Occasionally, you'll also find a document that uses "Microsoft Smart-Quotes" or somesuch garbage. These are a result of people creating web pages in MS-Word and MS-Powerpoint. It's most often evident in single and double quoting. 0x93 and 0x94 are the pairs for double quotes, don't know about single-quotes off hand.

    I'm a little unclear what the whole point of "Smart-Quotes" is, but they break things. In particular, they render as garbage on most non-MS browsers.

    Rather than duplicate some already good information, here's a link to an explanation, and a Perl program that strips the offending characters out.

    --Chris

    e-mail jcwren
      You can also include the Composer feature in Netscape Communicator. This
      also has the nasty habit of adding the & nbsp character in the code as well.

      TStanley
      In the end, there can be only one!
      As to the Microsoft Products, I'm sure it's some evil plan to try and establish their "dominance". What better way to do this than invent "technology" and have it implemented in only your products?

      I do use the '&nbsp;' once in a while on my pages, but with good reason. With Netscape at least, empty tables will disappear without one of the characters in the table. I use colored tables on my pages quite extensively instead of horizontal rules, and this is necessary to do the job.

      - p u n k k i d
      "Reality is merely an illusion, albeit a very persistent one." -Albert Einstein

      Update: Sorry folks, didn't read all the posts before I commented about the tables

Re: Spaced Out
by SilverB1rd (Scribe) on Feb 22, 2001 at 22:09 UTC
    This "non-breaking space" is vary useful when working with tables. But programs like Dreamweaver use them like theres no tomorrow. The nbsp is in 'theory' a space that the word wraper does not see as a space. It sounds like the nbsp is not to blame but rather the html praser.
Re: Spaced Out
by dws (Chancellor) on Feb 23, 2001 at 00:05 UTC
    There's a very good argument that what you've run into is a "requirements bug". The HTML entity &nbsp; exists for layout formatting. Other than making it possible for HTML::Entity::encode() and HTML::Entity::decode() to be symmetric, what purpose is served by translating &nbsp; into '\240'?

    Does anyone have any insight to share on this?

Re: Spaced Out
by Chmrr (Vicar) on Feb 23, 2001 at 02:19 UTC
    Well, it did exactly what it was advertised as doing -- it didn't break on that space. ;> Of course, this is yet anoother case of people using tags or special characters for what they look like, instead of what they mean. O, for an enforcable standard..

    perl -e 'print "I love $^X$\"$]!$/"#$&V"+@( NO CARRIER'
Re: Spaced Out
by Maclir (Curate) on Feb 23, 2001 at 03:09 UTC
    An excellent piece of work. I must say that I tend to use teh non-breaking space in my HTML code, both to fill otherwise empty table cells, and for its true purpose - to insert a space between characters that I don't want wrapped over two lines - for example, writing out telephone numbers:
    Phone:  +61 2 9999 3114
    Mobile: 0407 265
    199

    My recommendation would be that the ascii character 0xA0 should be included in the whitespace character class (\s). Something for the Perl-5 porters?

Re: Spaced Out
by clemburg (Curate) on Feb 23, 2001 at 15:22 UTC

    Thanks for this excellent and illuminating meditation.

    OTOH, I feel there's no need to call for the perl porters to include this little wishlet into their big task list. It's trivial to work around this problem (once you know it exists, admitted - that's something for the HTML::Parser docs, or maybe the Perl FAQ). And it even looks more readable:

    use HTML::Entities; my $html = '&nbsp;'; my $text = HTML::Entities::decode_entities($html); print "\$text = '$text'\n"; my $whitespace = "[\s\240]"; if ($text =~ /^$whitespace$/) { print "yes\n"; } else { print "no\n"; }

    Christian Lemburg
    Brainbench MVP for Perl
    http://www.brainbench.com

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlmeditation [id://60270]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others meditating upon the Monastery: (10)
As of 2014-04-24 09:21 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    April first is:







    Results (565 votes), past polls