Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

Can't tell if UTF-8... or just binary...

by Kirsle (Pilgrim)
on Aug 23, 2011 at 18:35 UTC ( #921962=perlquestion: print w/replies, xml ) Need Help??

Kirsle has asked for the wisdom of the Perl Monks concerning the following question:

Hello monks, I have an interesting dilemma on my hands.

I'm trying to find a way to determine whether some arbitrary blob of data is text or just binary. I used to have an old "is_binary()" method, which just looks for characters that fall outside of the 127 byte ASCII range, but that doesn't work when the string contains Unicode characters, because the control characters are outside the ASCII range.

sub is_binary { my $content = shift; if(!defined($content)) { return 0; } my @char = unpack("C" x length($content),$content); foreach $a (@char) { if($a > 127) { return 1; } } return 0; }

Here's a script I'm using for testing, to try to figure out a way to detect whether data is UTF-8 or just random binary:

#!/usr/bin/perl -w use strict; use warnings; use utf8; use lib "www/siikir/cms/src"; use Siikir::Util; binmode(STDOUT, "utf8"); # Valid UTF-8 strings my @valid = @{ Siikir::Util::utf8_decode([ "hello world", "Hello!\nWorld!", "My favorite pokemon is ブラッキー", "No, エーフィ is better than ブ&#125 +21;ッキー!", "ミュウツー ミュウ +ツー", ])}; # Create some invalid strings. my @invalid = ( scalar(`cat /usr/bin/vim`), scalar(`cat /usr/share/pixmaps/xchat.png`), scalar(map { chr(hex($_)) } qw/0xFF 0x4C 0x3D 0x10 0x27 0x78 0xED/ +), ); chomp(@invalid); print "Testing valid strings...\n"; foreach my $v (@valid) { my $pass = is_binary($v); print "Str: $v (pass: $pass)\n"; } print "Testing invalid strings...\n"; foreach my $i (@invalid) { my $pass = is_binary($i); print "Pass: $pass\n"; } sub is_binary { my $data = shift; # # Valid UTF-8? Fail: gives a pass to everything. # use Test::utf8; # if (is_valid_string($data)) { # return "true"; # } # return "false"; # Sane UTF-8? Fail: gives a pass to a PNG image # use Test::utf8; # if (is_sane_utf8($data)) { # return "true"; # } # return "false"; # Valid UTF-8? if (utf8::is_utf8($data)) { return "true"; } return "false"; }

(The utf8::decode function can be found on another one of my perlmonks posts, JSON, UTF-8 and Filehandles).

Seems like the only reliable method I found was just to rely on the is_utf8 flag (and relying on the assumption that most valid strings throughout the code have been properly decoded to have the UTF-8 flag on them).

Is there a better way?

Replies are listed 'Best First'.
Re: Can't tell if UTF-8... or just binary...
by ikegami (Pope) on Aug 23, 2011 at 19:30 UTC
    This makes no sense:
    use utf8;
    utf8_decode("My favorite pokemon is ブラッキー")
    

    «use utf8;» will decode, and then you try to decode again?

    Contrary to what you say, @valid does not contain valid UTF-8 strings, since you remove the UTF-8 encoding.


    I'm trying to find a way to determine whether some arbitrary blob of data is text or just binary.

    If there are any characters above 255, then it's surely text. Beyond that, it's impossible to determine.

    $ perl -MDevel::Peek -e'$_="\x{00C9}ric"; Dump($_);' FLAGS = (POK,pPOK) PV = 0x826d060 "\311ric"\0 # Text (My name) $ perl -MDevel::Peek -e'$_=pack("N", 0xC9726963); Dump($_);' FLAGS = (POK,pPOK) PV = 0xa16b2d8 "\311ric"\0 # Binary

    (Irrelevant output removed for brevity and clarity.)

    figure out a way to detect whether data is UTF-8 or just random binary:

    If there are any characters above 127, just try to decode it. If you're successful, it's surely UTF-8.

    my $is_utf8 = eval { decode('UTF-8', $bytes, Encode::FB_CROAK|Encode::LEAVE_SRC); 1 };

    The chances of getting a false positive are extremely slim. However, if all the characters are below 128, it's impossible to determine.

    $ perl -MDevel::Peek -e'$_="ABCD"; Dump($_);' FLAGS = (POK,pPOK) PV = 0x81bb060 "ABCD"\0 # Text $ perl -MDevel::Peek -e'$_=pack("N", 0x41424344); Dump($_);' FLAGS = (POK,pPOK) PV = 0x9e0a2c8 "ABCD"\0 # Binary
Re: Can't tell if UTF-8... or just binary...
by zentara (Archbishop) on Aug 23, 2011 at 18:46 UTC
    Maybe Encode::Guess or Encode::Detect can help.
    #!/usr/bin/perl use warnings; use strict; use Encode; use Encode::Guess; my $decoder = guess_encoding($content); print "UTF-8" if ref($decoder) eq 'Encode::utf8'; __END__
    #!/usr/bin/perl use warnings; use strict; use Encode::Detect::Detector; my $octets = "\x{4f60}\x{597d}\x{4e16}\x{754c}"; my $charset = Encode::Detect::Detector::detect($octets); print "$charset\n"; $octets = "\x82\xb7\x82\xb2\x82\xa2\x82\xcc\x82\xdd\x82\xc2"; $charset = Encode::Detect::Detector::detect($octets); print "$charset\n"; $octets = "\x{805a}\x{5408}\x{6216}\x{8be6}\x{7ec6}"; $charset = Encode::Detect::Detector::detect($octets); print "$charset\n";

    I'm not really a human, but I play one on earth.
    Old Perl Programmer Haiku ................... flash japh
Re: Can't tell if UTF-8... or just binary...
by bart (Canon) on Aug 23, 2011 at 21:29 UTC
    I used to have a text editor that determined whether a file was text or binary based on whether it contains null bytes ("\0"). It works extremely well in practice, since virtually all binary strings contain null bytes.

    It'll work as well with Unicode text, at least, if it's UTF-8. 16 bit (and 32 bit) Unicode text contains a lot of null bytes, typically every other byte for 16 bit, and 3 out of every 4 bytes for 32 bits.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://921962]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others cooling their heels in the Monastery: (5)
As of 2021-09-24 06:48 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?