Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Re^2: What's the best way to detect character encodings, Windows-1252 v. UTF-8? (Areas of confusion)

by ikegami (Pope)
on Jun 17, 2011 at 15:53 UTC ( #910190=note: print w/ replies, xml ) Need Help??


in reply to Re: What's the best way to detect character encodings, Windows-1252 v. UTF-8?
in thread What's the best way to detect character encodings, Windows-1252 v. UTF-8?

That code would only guess wrong if all of the following are true:

  • The text is encoded using Windows-1252,
  • At least one of [€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ<NBSP>¡¢£¤¥¦§¨©ª«¬<SHY>®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷] is present,
  • All instances of [ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞß] are always followed by exactly one of [€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ<NBSP>¡¢£¤¥¦§¨©ª«¬<SHY>®¯°±²³´µ¶·¸¹º»¼½¾¿],
  • All instances of [àáâãäåæçèéêëìíîï] are always followed by exactly two of [€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ<NBSP>¡¢£¤¥¦§¨©ª«¬<SHY>®¯°±²³´µ¶·¸¹º»¼½¾¿],
  • All instances of [ðñòóôõö÷] are always followed by exactly three of [€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ<NBSP>¡¢£¤¥¦§¨©ª«¬<SHY>®¯°±²³´µ¶·¸¹º»¼½¾¿],
  • None of [øùúûüýþÿ] are present, and
  • None of [€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ<NBSP>¡¢£¤¥¦§¨©ª«¬<SHY>®¯°±²³´µ¶·¸¹º»¼½¾¿] are present except where previously mentioned.

In other words, that code is very reliable.


Comment on Re^2: What's the best way to detect character encodings, Windows-1252 v. UTF-8? (Areas of confusion)
Select or Download Code

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://910190]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others romping around the Monastery: (6)
As of 2014-07-13 14:27 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    When choosing user names for websites, I prefer to use:








    Results (249 votes), past polls