http://www.perlmonks.org?node_id=396524

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I'd like to tell binary from text data (like the Unix file program). Right now I'm using
sub is_text { my $c = shift; my $cl = length($c); return 0 if ($cl == 0); my $t = $c; $t =~ tr/a-zA-Z0-9//cs; return ($cl - length($t)) < 100; }
which obviously has a lot of room for improvements. Is there a CPAN module for this kind of thing?

Replies are listed 'Best First'.
Re: Distinguishing text from binary data
by DrHyde (Prior) on Oct 05, 2004 at 13:14 UTC
    In general you need to read the whole string one character at a time and if you come across something that doesn't make sense in your character encoding, it's binary. Otherwise it's text. In the case of ASCII text, the characters that don't make sense are most of the control characters and 0x7F - 0xFF. The following control characters are usually considered OK in text data:
    • 0x09 - tab
    • 0x0A - line feed
    • 0x0C - form feed
    • 0x0D - carriage return
    A regex-ish way of detecting non-ASCII data based on that might be:
    print ($text =~ /[^\x09\x0a\x0c\x0d\x20-\x7e]/) ? "binary\n" : "text\n";

      further re DrHyde's offering: Tho he did not make it explicit, his approach offers a good first step for protecting yourself against embedded malware.

      Obvious? Maybe. Maybe that's already why you're checking the input. Or maybe the http: response is coming from a machine you control and thus, trust.

      But unless you're rilly, rilly POSITIVE! the incoming data is always going to be clean, you really do want to consider the obvious... very early in the game.

Re: Distinguishing text from binary data
by dave_the_m (Monsignor) on Oct 05, 2004 at 09:26 UTC
    Of course, if the text is still in a file, you can just do -T $filename

    Dave.

      No it isn't a file - it's a body of an HTTP response. I suppose I can write it to a file, although it does seem like a waste...
        No it isn't a file - it's a body of an HTTP response
        In that case, is examining the Content-type: header in the response sufficient?

        Dave.

        Those tests work on filehandles too, so if you have a socket, that should work.

        perldoc -f -X says it reads the current buffer for FHs, and the first block for files.

        -nuffin
        zz zZ Z Z #!perl
Re: Distinguishing text from binary data
by Happy-the-monk (Canon) on Oct 05, 2004 at 13:25 UTC

    Is there a CPAN module for this kind of thing?

    There are numerous. Some of them even work on scalars, not only filehandles. Have a look at File::MimeInfo::Magic and File::LibMagic for a start.

    For most similar tasks, yet not for your actual one you may find File::MMagic most useful.

    Cheers, Sören

Re: Distinguishing text from binary data
by inman (Curate) on Oct 05, 2004 at 10:41 UTC
    Your code is a little restrictive as it treats linefeeds, whitespace, punctuation etc. as non-characters and then decides that something is text if there are less than 100 of them. Try changing your code to work on ranges of the ascii table and then use a percentage as your test.
      Also don't forget about non-english encodings in which form data can be sent (english coders often forget about it :-) ). IMO, presence of 0x00..0x1F bytes in such data as HTTP response can mark it as binary (unless the form is sent in utf-8). So maybe you should take into consideration charset from Content-Type header and only then analyze byte/character stream.
Re: Distinguishing text from binary data
by data64 (Chaplain) on Oct 06, 2004 at 16:52 UTC