Distinguishing text from binary data

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I'd like to tell binary from text data (like the Unix file program). Right now I'm using

sub is_text
{
    my $c = shift;
    my $cl = length($c);
    return 0 if ($cl == 0);

    my $t = $c;
    $t =~ tr/a-zA-Z0-9//cs;
    return ($cl - length($t)) < 100;
}
[download]

which obviously has a lot of room for improvements. Is there a CPAN module for this kind of thing?

Comment on Distinguishing text from binary data Download Code

Replies are listed 'Best First'.
Re: Distinguishing text from binary data by DrHyde (Prior) on Oct 05, 2004 at 13:14 UTC
In general you need to read the whole string one character at a time and if you come across something that doesn't make sense in your character encoding, it's binary. Otherwise it's text. In the case of ASCII text, the characters that don't make sense are most of the control characters and 0x7F - 0xFF. The following control characters are usually considered OK in text data: 0x09 - tab 0x0A - line feed 0x0C - form feed 0x0D - carriage return A regex-ish way of detecting non-ASCII data based on that might be: `print ($text =~ /[^\x09\x0a\x0c\x0d\x20-\x7e]/) ? "binary\n" : "text\n";` [download]	[reply] [d/l]
Re^2: Distinguishing text from binary data by ww (Archbishop) on Oct 05, 2004 at 14:22 UTC
further re DrHyde's offering: Tho he did not make it explicit, his approach offers a good first step for protecting yourself against embedded malware. Obvious? Maybe. Maybe that's already why you're checking the input. Or maybe the http: response is coming from a machine you control and thus, trust. But unless you're rilly, rilly POSITIVE! the incoming data is always going to be clean, you really do want to consider the obvious... very early in the game.	[reply]
Re: Distinguishing text from binary data by dave_the_m (Monsignor) on Oct 05, 2004 at 09:26 UTC
Of course, if the text is still in a file, you can just do `-T $filename` Dave.	[reply] [d/l]
Re^2: Distinguishing text from binary data by Anonymous Monk on Oct 05, 2004 at 09:49 UTC
No it isn't a file - it's a body of an HTTP response. I suppose I can write it to a file, although it does seem like a waste...	[reply]
Re^3: Distinguishing text from binary data by dave_the_m (Monsignor) on Oct 05, 2004 at 10:17 UTC
No it isn't a file - it's a body of an HTTP response In that case, is examining the `Content-type:` header in the response sufficient? Dave.	[reply] [d/l]
Re^4: Distinguishing text from binary data by gothic_mallard (Pilgrim) on Oct 05, 2004 at 11:52 UTC
Re^3: Distinguishing text from binary data by nothingmuch (Priest) on Oct 06, 2004 at 00:21 UTC
Those tests work on filehandles too, so if you have a socket, that should work. `perldoc -f -X` says it reads the current buffer for FHs, and the first block for files. -nuffin zz zZ Z Z #!perl	[reply] [d/l]
Re: Distinguishing text from binary data by Happy-the-monk (Canon) on Oct 05, 2004 at 13:25 UTC
Is there a CPAN module for this kind of thing? There are numerous. Some of them even work on scalars, not only filehandles. Have a look at File::MimeInfo::Magic and File::LibMagic for a start. For most similar tasks, yet not for your actual one you may find File::MMagic most useful. Cheers, Sören	[reply]
Re: Distinguishing text from binary data by inman (Curate) on Oct 05, 2004 at 10:41 UTC
Your code is a little restrictive as it treats linefeeds, whitespace, punctuation etc. as non-characters and then decides that something is text if there are less than 100 of them. Try changing your code to work on ranges of the ascii table and then use a percentage as your test.	[reply]
Re^2: Distinguishing text from binary data by maard (Pilgrim) on Oct 06, 2004 at 10:26 UTC
Also don't forget about non-english encodings in which form data can be sent (english coders often forget about it :-) ). IMO, presence of 0x00..0x1F bytes in such data as HTTP response can mark it as binary (unless the form is sent in utf-8). So maybe you should take into consideration charset from Content-Type header and only then analyze byte/character stream.	[reply]
Re: Distinguishing text from binary data by data64 (Chaplain) on Oct 06, 2004 at 16:52 UTC
You could look at the perl implementation of the unix file utility. It is(was) part of the PPT project.	[reply]

Back to Seekers of Perl Wisdom