Removing Unsafe Characters

Praethen has asked for the wisdom of the Perl Monks concerning the following question:

I'll do my best to get this right though I warn this is not an Intrepid approved post.

I have 40,000 mostly HTML files that are generally displayed through a perl script (usually one at a time). Many of these files seem to contain, what looks like, nasty unicode characters that browsers tend to render as boxes, question marks or their flavor for "I can't print this character" of the week.

I'm trying to scrub out these nasty unicode characters, I'm using (with success) $input =~ s/[^\x00-\x7F^\xA1-\xFF]/\ /g;

This seems to work fairly well but it means I lose characters in \xA0 - \xFF range -- which is unfortunate because I'd rather convert those to their HTML equivalent. (So its resumé instead of resum )

I came up with two techniques for this that I *thought* would work but I cannot find an acceptable syntax.

1. Search for the high-range codes where there are HTML equivalents (\xA0-\xFF), decode into decimal and place appropriate HTML pre- and suffix (i.e. \xE9 becomes é) $input =~ s/([\xA0-\xFF])/&#ord($1);/gie;

That will fail, because it's trying to evaluate &# and it can't. That's my problem... and maybe it's a very novice issue but don't know how to get the &# prefix and the ; suffix in there, I've tried a dozen methods but all of them are wrong.

2. bobf kindly put me onto http://search.cpan.org/~gaas/HTML-Parser-3.60/lib/HTML/Entities.pm -- I tried using encode_entities($input, "\xA0-\xFF"); (I tried the decimal equivalent as well) but no love. If I simply use encode_entities it likes to eat the < and > tags (obviously) and that's bad for all the HTML.

Option 2 seems like a more reasonable solution than my hack but I don't know how to modify it for my purposes. Sorry for the long post but in prepping this I didn't want to be guilty of the XY problem. Have a good eve.

Thanks.

update: So a bit more on the architecture at work here as I begin to try out some of the solutions. The system processes web-posted and email-posted messages (been doing it basically the same way since 2001), I didn't write it and only have a cursory understanding of how it works. Messages get posted in a flat-file database system, for each message a perl file is created to hold the text. There is some minimal processing on the characters before being stored.

From there an interface provides access to each file when called. It does some minimal processing. It attempts to keep most of the formatting from the original message as these are collections of stories so the text formatting can be vital to presentation. Sometimes it doesn't work so well because the website's templates are black-backgrounded and the vast majority of those processed emails were on white-backgrounds. Nevertheless, usually just changing black to white is all that is necessary.

The system has been effective for 8 years, but recently more and more garbage characters are ending up in the final product. It's ugly, distracting and detracts from the content. In other words, the bane of my meager no-pay web programming existence.

Comment on Removing Unsafe Characters Select or Download Code

Replies are listed 'Best First'.
Re: Removing Unsafe Characters by almut (Canon) on Apr 28, 2009 at 07:12 UTC
Something like this seems to work with both character/unicode strings and legacy ISO-Latin-1 input: `use HTML::Entities; my $input = "abc < ä > <p> Ö & ü xyz "; # ISO-Latin-1 my $encoded; $encoded = encode_entities($input, "\xA0-\x{FFFD}"); print "$encoded\n"; # now upgrade $input to character string (utf8) # (by appending some unicode characters) $input .= "\x{5555} \x{8888}"; $encoded = encode_entities($input, "\xA0-\x{FFFD}"); print "$encoded\n";` [download] which would print: `abc < ä > <p> Ö & ü xyz abc < ä > <p> Ö & ü xyz 啕袈` [download] (...at least for characters up to `\x{FFFD}`) Hint: this works because HTML::Entities-internally this is simply turned into the regex substitution: `s/([\xA0-\x{FFFD}])/$char2entity{$1} \|\| num_entity($1)/ge;` [download] Update: the character class could in principle also be extended to cover the "surrogates range" (aka supplementary characters), which would then be `"\xA0-\x{FFFD}\x{10000}-\x{10FFFD}"` (IIRC) Update 2: Note that this would properly encode unicode characters as the corresponding HTML entities. Whether the browser then has the appropriate fonts to render those characters correctly, is another matter (but these days, browsers are able to render quite a lot of unicode characters properly, even with the default configuration). Also note that the result this achieves is different from simply sending UTF-8 encoded pages to the browser without declaring them as such (which would produce garbage...). In case you'd rather want to convert any byte value with the high bit set (80-FF) into its ISO-Latin-1 entity representation (which I think is what you wanted to do originally), you'd have to make sure that Perl always treats your input strings as bytes (i.e. utf8 flag off) — but that would be a suboptimal solution, IMO, as you'd misrepresent unicode characters (which are still recognized as such in your input) as sequences of inappropriate characters from the ISO-Latin-1 range... Update 3: (last one, promised :) It seems a complementary/exclusion character class (using `^`) works as well, e.g. `$encoded = encode_entities($input, "^\x20-\x7E"); # do not encode pri +ntable ASCII chars` [download] That way you wouldn't need to worry about what the correct positive set is... (This is undocumented, though, so no guarantees!)	[reply] [d/l] [select]
Re^2: Removing Unsafe Characters by Praethen (Scribe) on Apr 30, 2009 at 02:17 UTC
Part 1: `$encoded = encode_entities($input, "\xA0-\x{FFFD}");` -- Sadly it didn't work. I then began to try to investigate the actual encoding used for the files. Maybe if I can figure out that, then I can figure out how to properly convert them. I don't have File::MMagic as suggested at How do I determine encoding format of a file ? but I do have Encode::Guess, I got that running and immediately got Unknown encoding error exactly at the place where I have a garbage character. When running Encode::Guess on the data as a string (instead of an array) I got No appropriate encodings found! I focused in on this character, maybe it could give some clues as to my problem. I used the ord() function to try and isolate the character. Two characters return junk and their decimal equivalents are 226 and 128. The 226 is valid but 128 isn't. To top all of that, I'm positive that the user's intended character was a hyphen. I feel even more lost than when I started. None of the solutions provided work properly, I either get more junk characters or I get valid characters that shouldn't be there at all. I think I'll give up on this question and try and chase down how to determine what the character encoding is on these files. The problem is I have 40,000+ files, how many different encodings could there be? (I'm guessing a few)	[reply] [d/l]
Re: Removing Unsafe Characters by ikegami (Patriarch) on Apr 28, 2009 at 06:42 UTC
Removes all Unicode characters: `my $output = ''; while ($input =~ /(.)/sg) { my $ch = $1; my $ord = ord($ch); $output .= $ch if $ord >= 0xD800 && $ord <= 0xDFFF \|\| $ord >= 0xFDD0 && $ord <= 0xFDEF \|\| ($ord & 0xFFFF) == 0xFFFE \|\| ($ord & 0xFFFF) == 0xFFFF \|\| $ord >= 0x110000; }` [download] :)	[reply] [d/l]
Re^2: Removing Unsafe Characters by Praethen (Scribe) on Apr 28, 2009 at 20:56 UTC
Thanks for the suggestion, though it took "This is a test message." and created... "isisaesmessage" ;)	[reply]
Re^3: Removing Unsafe Characters by ikegami (Patriarch) on Apr 28, 2009 at 21:32 UTC
Oops! Added missing parens. I hate `&`'s precedence.	[reply] [d/l]
Re: Removing Unsafe Characters by CountZero (Bishop) on Apr 28, 2009 at 08:25 UTC
Did you have a look at Text::Unidecode? It is not perfect (accented characters come out as their non-accented form), but it does a good job with the really exotic ones. From its docs: What Text::Unidecode provides is a function, `unidecode(...)` that takes Unicode data and tries to represent it in US-ASCII characters (i.e., the universally displayable characters between 0x00 and 0x7F). The representation is almost always an attempt at transliteration -- i.e., conveying, in Roman letters, the pronunciation expressed by the text in some other writing system. CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James	[reply] [d/l]
Re^2: Removing Unsafe Characters by ikegami (Patriarch) on Apr 28, 2009 at 13:17 UTC
It is not perfect (accented characters come out as their non-accented form) Easily fixed `$text =~ s/(\P{Latin}+)/unidecode("$1")/ge;` [download]	[reply] [d/l]
Re: Removing Unsafe Characters by Anonymous Monk on Apr 28, 2009 at 06:49 UTC
You need to make sure your files are properly encoded (eg utf8) and the document charset matches. BTW, there are no unsafe characters in html, but literal `< ' & " >` do need to be encoded when required :)	[reply] [d/l]
Re^2: Removing Unsafe Characters by Praethen (Scribe) on Apr 28, 2009 at 07:34 UTC
Thanks. In this case the files are actually emails that have been parsed over many years from many different ISPs. I doubt there is any uniformity in their original encodings (nor are any email headers maintained in the files, only the email bodies and some other relevant data) and I don't have the technical knowledge on how best to deal with such a situation. That said, I'll review Perl encodings in the morning. As far as encoding literal < ' & " > I can only rely on the Mail Providers to have properly done that to begin with or the situation is hopeless. (i.e. I can't easily guess which < is intended to be an HTML start delimiter an email quoting method or just someone pointing) update: Well, it seems this is the can of worms I feared to open. I admit it is all very much above my head in terms of technical understanding. This wouldn't be a major issue if I were paid to work on this problem but I am a tinkerer. I just don't understand perl and encodings enough to fully grasp the problem, let alone the solution. The server does return UTF-8 Charset. Which, after googling what characterset does perl encode in, seems to be Unicode UTF-8. This may well be a problem I cannot tackle effectively but hopefully some of the solutions here will work. Thanks.	[reply]
Re: Removing Unsafe Characters by StommePoes (Scribe) on Apr 29, 2009 at 13:57 UTC
If this content is all really old and comes from everywhere, you likely have more to worry about than just utf-8 vs latin-1... I run into the Windows 1252 stuff sometimes and the problem there, as I understand it, is while it often has the same characters as ISO-8859-1, it also has many that are just some MS version of a character. How many people typed something in Word and then sent as an email or through onto a web site?	[reply]
Re^2: Removing Unsafe Characters by ikegami (Patriarch) on Apr 29, 2009 at 14:17 UTC
Once you decode UTF-8, iso-latin-1 and cp1252, you end up with Unicode characters, so that doesn't change the problem: Determining which Unicode characters can be represented by most browser/computer setups, and determining what to do with those that can't. Yes, you might get undecodable text if you receive something that's in the wrong encoding. And yes, a different character than the intended one might be displayed. But that's an entirely different problem than the one the OP asked about.	[reply]


Just another Perl shrine
	PerlMonks