Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister

comment on

( #3333=superdoc: print w/replies, xml ) Need Help??
Hello monks, I have an interesting dilemma on my hands.

I'm trying to find a way to determine whether some arbitrary blob of data is text or just binary. I used to have an old "is_binary()" method, which just looks for characters that fall outside of the 127 byte ASCII range, but that doesn't work when the string contains Unicode characters, because the control characters are outside the ASCII range.

sub is_binary { my $content = shift; if(!defined($content)) { return 0; } my @char = unpack("C" x length($content),$content); foreach $a (@char) { if($a > 127) { return 1; } } return 0; }

Here's a script I'm using for testing, to try to figure out a way to detect whether data is UTF-8 or just random binary:

#!/usr/bin/perl -w use strict; use warnings; use utf8; use lib "www/siikir/cms/src"; use Siikir::Util; binmode(STDOUT, "utf8"); # Valid UTF-8 strings my @valid = @{ Siikir::Util::utf8_decode([ "hello world", "Hello!\nWorld!", "My favorite pokemon is ブラッキー", "No, エーフィ is better than ブ&#125 +21;ッキー!", "ミュウツー ミュウ +ツー", ])}; # Create some invalid strings. my @invalid = ( scalar(`cat /usr/bin/vim`), scalar(`cat /usr/share/pixmaps/xchat.png`), scalar(map { chr(hex($_)) } qw/0xFF 0x4C 0x3D 0x10 0x27 0x78 0xED/ +), ); chomp(@invalid); print "Testing valid strings...\n"; foreach my $v (@valid) { my $pass = is_binary($v); print "Str: $v (pass: $pass)\n"; } print "Testing invalid strings...\n"; foreach my $i (@invalid) { my $pass = is_binary($i); print "Pass: $pass\n"; } sub is_binary { my $data = shift; # # Valid UTF-8? Fail: gives a pass to everything. # use Test::utf8; # if (is_valid_string($data)) { # return "true"; # } # return "false"; # Sane UTF-8? Fail: gives a pass to a PNG image # use Test::utf8; # if (is_sane_utf8($data)) { # return "true"; # } # return "false"; # Valid UTF-8? if (utf8::is_utf8($data)) { return "true"; } return "false"; }

(The utf8::decode function can be found on another one of my perlmonks posts, JSON, UTF-8 and Filehandles).

Seems like the only reliable method I found was just to rely on the is_utf8 flag (and relying on the assumption that most valid strings throughout the code have been properly decoded to have the UTF-8 flag on them).

Is there a better way?

In reply to Can't tell if UTF-8... or just binary... by Kirsle

Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":

  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or or How to display code and escape characters are good places to start.
Log In?

What's my password?
Create A New User
Domain Nodelet?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others wandering the Monastery: (2)
As of 2021-10-20 00:05 GMT
Find Nodes?
    Voting Booth?
    My first memorable Perl project was:

    Results (78 votes). Check out past polls.