Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Fixing suspect characters in HTML

by wfsp (Abbot)
on Jul 30, 2005 at 12:02 UTC ( #479601=perlquestion: print w/ replies, xml ) Need Help??
wfsp has asked for the wisdom of the Perl Monks concerning the following question:

Due to carelessness on my part I had a shed load of html containing suspect characters. The difficulty was having a possible combination of x80-x9F (frowned on by w3c), unicode and html entities (including numeric entities).
The strategy I arrived at was to:
  1. decode any entities present
  2. convert x80-x9F to unicode equivalents
  3. encode 'unsafe' characters

This will ensure, hopefully, consistant html and prevent problems during any future processing.

What do you reckon?

#!/usr/bin/perl use strict; use warnings; use HTML::Entities; my $lookup = get_cp1252_lookup(); my $str = join('', chr(0x93), 'double', chr(0x94), chr(0x201C), 'double', chr(0x201D), '&lsquo;single&rsquo;' ); # "replaces HTML entities... # with the corresponding Unicode character" decode_entities($str); # replaces x80-x9f with unicode equivalant $str =~ s/([\x80-\x9f])/$lookup->{sprintf("%x", ord($1))}/eg; # "replaces unsafe characters... # with their entity representation" encode_entities($str); print "$str\n"; sub get_cp1252_lookup{ open my $fh, '<', 'cp1252_to_unicode.txt' or die "can't open input: $!"; my $lookup; while (<$fh>){ my ($cp1252, $utf8_str, $name) = split /\t/; $cp1252 =~ s/0x//; my $utf8 = $utf8_str =~ / /? '':chr(oct($utf8_str)); $lookup->{$cp1252} = $utf8; } return $lookup; } __END__ output: &ldquo;double&rdquo;&ldquo;double&rdquo;&lsquo;single&rsquo; extract from cp1252_to_unicode.txt: 0x91 0x2018 #LEFT SINGLE QUOTATION MARK 0x92 0x2019 #RIGHT SINGLE QUOTATION MARK 0x93 0x201C #LEFT DOUBLE QUOTATION MARK 0x94 0x201D #RIGHT DOUBLE QUOTATION MARK
Many thanks to all the monks who have helped.
John

Comment on Fixing suspect characters in HTML
Download Code
Re: Fixing suspect characters in HTML
by graff (Chancellor) on Jul 30, 2005 at 15:24 UTC
    Looks okay to me. I thought your test string might be a bit too easy (didn't cover enough possible trouble makers), and I wondered whether putting "decode_entities" before the cp1252 lookup might cause a problem (because when you decode entities like "&#209;" ("Ñ"), you get utf8 byte sequences that include bytes like 0x91, which might get mistreated by the cp1252_lookup).

    But then I tried it out, adding "Ñ" and "Ò" to the test string, and they magically came out right:

    ... my $str = join('', chr(0x93), 'double', chr(0x94), ' &#209; &#210; ', chr(0x201C), 'double', chr(0x201D), '&lsquo;single&rsquo;' ); ... output: &ldquo;double&rdquo; &Ntilde; &Ograve; &ldquo;double&rdquo;&lsquo;sing +le&rsquo;
    which looks like what you would want to get.

    Update: based on your reply, I figured it might make sense to try numeric character entities above 0xff -- e.g. &#465; and &#466; (when converted to utf8, these have 0x91 and 0x92 as the second byte). It still works the way you would want, converting them correctly to hex-coded numeric entities (&#x1D1; and &#x1D2;, upper and lower case letter o with caron, respectively).

      Thanks for your response.

      For me, your string returned:

      93 64 6f 75 62 6c 65 94 d1 20 d2 201c 64 6f 75 62 6c 65 201d 2018 73 69 6e 67 6c 65 2019
      xd1 and xd2 are outside the range being checked (x80-x9F), are legal unicode (the cp1252/unicode chart gives the same codes) and encode_entities returned (as you found):
      &Ntilde; &Ograve;
      Note that, for example, &rsquo; returned x2019 so wouldn't be mucked about by the cp1252 replacement.

      The comment in the script:

      # "replaces HTML entities... # with the corresponding Unicode character"
      is from the H::E doc and had more significance than I first realised.

      __But__ I too am surprised it appears to work. I wrote quite a bit of code to process utf8 and was almost a bit miffed that it seemed unnecessary!

      This my first outing in these waters so will be pleased to be corrected if I've got any of this tangled up.

      Again, thanks for your comments,
      John

      unicode.org cp1252 chart

      update:

      Extract from the chart:

      cp1252 unicode 0xD1 0x00D1 #LATIN CAPITAL LETTER N WITH TILDE 0xD2 0x00D2 #LATIN CAPITAL LETTER O WITH GRAVE

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://479601]
Approved by GrandFather
Front-paged by planetscape
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others meditating upon the Monastery: (12)
As of 2014-07-23 11:13 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (140 votes), past polls