Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Filtering out bad UTF8 chars

by FreakyGreenLeaky (Sexton)
on Oct 12, 2011 at 17:52 UTC ( #931058=perlquestion: print w/ replies, xml ) Need Help??
FreakyGreenLeaky has asked for the wisdom of the Perl Monks concerning the following question:

Greetings Monks,

I'm trying to figure out how to filter/blank out bad UTF8 chars.

This particular snippet works perfectly to prevent croakage on bad UTF8 (ie, to identify bad UTF8 input prior to further processing):
use Encode qw(is_utf8); # check if $str is UTF8 and contains bad UTF8. print "bad UTF8\n" if is_utf8($str) and not is_utf8($str, 1);
Then, I may want to salvage what I can from $str (ie, filter/remove the bad UTF8 chars) by running it through iconv (which I read somewhere *may* remove bad UTF8 chars):
iconv -c --from UTF-8 --to UTF-8
However, that involves a slow shell call, so I tried Text::Iconv:
use Text::Iconv; my $conv = Text::Iconv->new("utf8", "utf8"); $str = $conv->convert($str);
But that does not filter out bad UTF8 chars.

Is there a fast/efficient cpan module/way to strip out any bad UTF8 chars?

If not a selective filter, then as a last resort, is there a brute-force method of simply removing *all* UTF8 chars?

Thanks

Comment on Filtering out bad UTF8 chars
Select or Download Code
Re: Filtering out bad UTF8 chars
by ikegami (Pope) on Oct 12, 2011 at 18:41 UTC

    decode already handles bad UTF-8.

    $ perl -MEncode -Mcharnames=:full -wE' $bad = "\xE9abc"; say sprintf "U+%04X %s", ord, charnames::viacode(ord) for split //, decode("UTF-8", $bad); ' U+FFFD REPLACEMENT CHARACTER U+0061 LATIN SMALL LETTER A U+0062 LATIN SMALL LETTER B U+0063 LATIN SMALL LETTER C

    It doesn't remove bad characters, but replaces them with U+FFFD. You could play with decode's third arg, or you could simply strip out the replacement character aftewards.

    s/\x{FFFD}//g;

      Thanks for the reply ikegami - I then get a Cannot decode string with wide characters at /usr/lib64/perl5/Encode.pm line 174. error, presumably because the text is already decoded, and I'm double-decoding (if I understand correctly).

      My problem is I have input from wildly varying sources (websites) with correspondingly wildly varying encodings...

      I think until I can find a way to handle these scenarios without crashing the backend, I'm going to not try and extract what can be extracted and simply skip these damn files.

      Luckily they're in the extreme minority and as much as it irks me to do this, I'm flagging this #TODO for now.

        My problem is I have input from wildly varying sources (websites) with correspondingly wildly varying encodings...

        But you asked about bad UTF-8?! Sorry, I don't understand your question at all.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://931058]
Approved by ikegami
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others scrutinizing the Monastery: (5)
As of 2014-09-24 04:15 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (245 votes), past polls