Kirsle has asked for the wisdom of the Perl Monks concerning the following question:

Hello monks, I have an interesting dilemma on my hands.

I'm trying to find a way to determine whether some arbitrary blob of data is text or just binary. I used to have an old "is_binary()" method, which just looks for characters that fall outside of the 127 byte ASCII range, but that doesn't work when the string contains Unicode characters, because the control characters are outside the ASCII range.

sub is_binary { my $content = shift; if(!defined($content)) { return 0; } my @char = unpack("C" x length($content),$content); foreach $a (@char) { if($a > 127) { return 1; } } return 0; }

Here's a script I'm using for testing, to try to figure out a way to detect whether data is UTF-8 or just random binary:

#!/usr/bin/perl -w use strict; use warnings; use utf8; use lib "www/siikir/cms/src"; use Siikir::Util; binmode(STDOUT, "utf8"); # Valid UTF-8 strings my @valid = @{ Siikir::Util::utf8_decode([ "hello world", "Hello!\nWorld!", "My favorite pokemon is ブラッキー", "No, エーフィ is better than ブ&#125 +21;ッキー!", "ミュウツー ミュウ +ツー", ])}; # Create some invalid strings. my @invalid = ( scalar(`cat /usr/bin/vim`), scalar(`cat /usr/share/pixmaps/xchat.png`), scalar(map { chr(hex($_)) } qw/0xFF 0x4C 0x3D 0x10 0x27 0x78 0xED/ +), ); chomp(@invalid); print "Testing valid strings...\n"; foreach my $v (@valid) { my $pass = is_binary($v); print "Str: $v (pass: $pass)\n"; } print "Testing invalid strings...\n"; foreach my $i (@invalid) { my $pass = is_binary($i); print "Pass: $pass\n"; } sub is_binary { my $data = shift; # # Valid UTF-8? Fail: gives a pass to everything. # use Test::utf8; # if (is_valid_string($data)) { # return "true"; # } # return "false"; # Sane UTF-8? Fail: gives a pass to a PNG image # use Test::utf8; # if (is_sane_utf8($data)) { # return "true"; # } # return "false"; # Valid UTF-8? if (utf8::is_utf8($data)) { return "true"; } return "false"; }

(The utf8::decode function can be found on another one of my perlmonks posts, JSON, UTF-8 and Filehandles).

Seems like the only reliable method I found was just to rely on the is_utf8 flag (and relying on the assumption that most valid strings throughout the code have been properly decoded to have the UTF-8 flag on them).

Is there a better way?