comment on

Looks okay to me. I thought your test string might be a bit too easy (didn't cover enough possible trouble makers), and I wondered whether putting "decode_entities" before the cp1252 lookup might cause a problem (because when you decode entities like "Ñ" ("Ñ"), you get utf8 byte sequences that include bytes like 0x91, which might get mistreated by the cp1252_lookup).

But then I tried it out, adding "Ñ" and "Ò" to the test string, and they magically came out right:

...
my $str = join('',
  chr(0x93), 'double', chr(0x94), ' &#209; &#210; ', 
  chr(0x201C), 'double', chr(0x201D),
  '&lsquo;single&rsquo;'
);
...
output:

&ldquo;double&rdquo; &Ntilde; &Ograve; &ldquo;double&rdquo;&lsquo;sing
+le&rsquo;
[download]

which looks like what you would want to get.

Update: based on your reply, I figured it might make sense to try numeric character entities above 0xff -- e.g. Ǒ and ǒ (when converted to utf8, these have 0x91 and 0x92 as the second byte). It still works the way you would want, converting them correctly to hex-coded numeric entities (Ǒ and ǒ, upper and lower case letter o with caron, respectively).

In reply to Re: Fixing suspect characters in HTML by graff
in thread Fixing suspect characters in HTML by wfsp

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


XP is just a number
	PerlMonks