Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister

Strange letters ...

by LanX (Sage)
on Jul 22, 2009 at 21:38 UTC ( #782459=monkdiscuss: print w/replies, xml ) Need Help??

From time to time (not reproducible!) some non-ascii letters get mixed up here during edits/updates.

e.g. "" gets transformed into "¹"... I suppose it's an unicode problem...

Is this a browser issue and should I change my settings in FF3?

Or is it a strange bug in the monastry?

Or is it just me who observes this phenomenon?

Cheers Rolf


It occured again with this post and I saved and made a diff in hexl-mode in emacs. What I got is a "c2b9" instead of a simple "b9" for "".

Replies are listed 'Best First'.
Re: Strange letters ...
by moritz (Cardinal) on Jul 22, 2009 at 21:51 UTC
    Perlmonks uses the windows-1252 (similar to Latin-1) encoding, and all characters that are not in that character set are HTML-escaped - which doesn't work inside <code>...</code> tags, because everything is interpreted literally there.

    Could that be the cause of your observations?


      Sometimes, I see some characters as the cp1252 and/or iso-latin-1 interpretation of their UTF-8 encoding.

      Assuming it's not a client side bug, it looks like PerlMonks outputs UTF-8 while claiming to output cp1252 in some circumstances. I'll see if I can nail down the details the next time it happens to me.

      If I view the same node from a different URL, I see the characters correctly. This could indicate a problem with one of the special nodes (e.g. node 3333)

        Assuming it's not a client side bug, it looks like PerlMonks outputs UTF-8 while claiming to output cp1252 in some circumstances.

        PerlMonks certainly may, under the circumstance of somebody submitting UTF-8 to be displayed. While HTTP has very good ways of clearly saying what encoding is being downloaded, it rather lacks in clarity in how a site is supposed to tell the client what encoding it would like things uploaded in and in how clients tell the server what encoding the stuff they are uploading is in.

        The state of the art on those two points appears to often come down to... guessing. Clients tend to guess that servers want stuff uploaded in the same encoding that the server used to present the form that offered the opportunity to upload. Servers tend to guess that uploads are done in the encoding that they tend to produce while some also look for byte sequences that seem likely in UTF-8 and then guess that what is being uploaded is UTF-8. There are also special Unicode escapes for URL-encoded data that servers can notice (however, a URL being in "escaped Unicode" doesn't necessarily say anything about any other parts of the upload).

        And there are cases where this is especially likely to go wrong at PerlMonks. If the client (like most) guesses that Windows-1252 is desired because a page at PerlMonks proclaims itself to be in Windows-1252 (like most of them do, but not all of them, at this point), then the client has to make yet another guess if the data to be submitted contains a character that is not covered by Windows-1252. Most clients, IME, guess that the way to deal with this is to HTML-escape the problem character using an HTML entity. And at PerlMonks, in many cases, that is correct. moritz noted that in the case of text inside of <code> tags, that guess fails (but he was incorrect in just proclaiming that HTML escaping is always what clients choose to use). It also fails for node titles. Some clients instead guess that the server might not be expecting HTML and opt to send the submission in UTF-8 so that they can include the troublesome character (probably guessing that the server will notice the typical pattern of UTF-8 bytes and guess correctly). Some clients guess that perhaps neither route will work and just send '?' for the troublesome character.

        I only have a vague recollection of the last time I heard of somebody looking at how the PerlMonks server guesses about encoding of uploads. But that vague memory says that PerlMonks notices Unicode escapes in URLs and doesn't notice UTF-8-like byte sequences and never guesses "UTF-8" about encoding of anything other than URL-encoded data.

        It used to be worse when PerlMonks claimed Latin-1 encoding when it was actually just re-sending out whatever bytes people were sending to it. Some Windows users would send bytes that represent characters in Windows-1252 but not in Latin-1. When in a node title, some Unix clients would try to deal with these strictly-speaking "illegal" bytes in interesting ways. Some would actually assume that the byte was really meant to be the Windows-1252 character despite the declared Latin-1 encoding. But then they would refuse to lower themselves to respond in kind and would struggle with what to do when asked to send back that byte. I was particularly amused to see some sending the UTF-8 encoding of the character (which demonstrates a certain kind of "double think" to my eye).

        A much better solution that I've suggested but I have not (nor has anybody else) implemented at PerlMonks is to include a 'hidden' field in each of our forms where the value of that field always contains a character/byte with the eighth bit set. Then we can quite deterministically determine whether or not the client is uploading in UTF-8 or not.

        More likely, we'll just convert all of our content to UTF-8, mostly so we can include the interesting characters inside of <code> tags, especially for Perl 6 code (once PerlMonks starts declaring all of our pages as being UTF-8, pretty much every client will always upload to us in UTF-8).

        But, actually, that probably has nothing to do with what you have observed.

        I have observed nodes that contain 8-bit characters rarely rendering incorrectly. When this has happened, it is rather random whether a refresh will be correctly rendered or not. I believe such strangeness (based in part on other, similar cases of strangeness) is actually due to bugs in Perl and/or Apache, that sometimes eventually manifest when a single process with a single Perl interpreter instance have managed to serve up a few hundred/thousand web pages. We'll get a few children having one of these problems and a refresh will sometimes hit a confused child and sometimes not. Restarting the web server makes the problem impossible to reproduce again. Waiting quite a while also usually ends with the problem just disappearing again.

        - tye        

      no it was definitively not a code area.

      I had this problem with the footnote in [emacs] perl info file and I didn't try to repair it.

      Though now it magically disappeared...

      As I said it occurs sometimes while updating or previewing a post, I'll check the header next time, maybe the CGI gets confused.

      Cheers Rolf

      UPDATE: IIRC it also happened with signs. And escaping to an HTML-Entity occurs with symbols like ⅛, but that's not the case here.

      To reproduce the problem, click moderate and observe this node.

      Update: Nevermind.

Re: Strange letters ...
by blahblahblah (Priest) on Jul 24, 2009 at 04:39 UTC
    testing a theory: "¹" "¹"
      I doubt this is what people are doing, but I achieved the above by manually changing the page's encoding (view -> encoding -> utf-8) and then submitting my node. Maybe some people's browsers are set to auto-detect encoding and are fouling up their submissions without them even knowing it?
        I doubt, because it's a very temporary phenomenon. This extra "" disappears perfectly after revisiting the node and may reappear later! But the change you made is permanently stored in the database!

        But I'm not sure if it's not just a Firefox thing, would be nice to know if it's observable with any other browser.

        My standard encoding is set to ISO 8859-1...

        Cheers Rolf

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: monkdiscuss [id://782459]
Approved by Corion
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others examining the Monastery: (2)
As of 2022-08-07 23:16 GMT
Find Nodes?
    Voting Booth?

    No recent polls found