Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw

Re: Distinguishing text from binary data

by DrHyde (Prior)
on Oct 05, 2004 at 13:14 UTC ( #396566=note: print w/replies, xml ) Need Help??

in reply to Distinguishing text from binary data

In general you need to read the whole string one character at a time and if you come across something that doesn't make sense in your character encoding, it's binary. Otherwise it's text. In the case of ASCII text, the characters that don't make sense are most of the control characters and 0x7F - 0xFF. The following control characters are usually considered OK in text data:
  • 0x09 - tab
  • 0x0A - line feed
  • 0x0C - form feed
  • 0x0D - carriage return
A regex-ish way of detecting non-ASCII data based on that might be:
print ($text =~ /[^\x09\x0a\x0c\x0d\x20-\x7e]/) ? "binary\n" : "text\n";

Replies are listed 'Best First'.
Re^2: Distinguishing text from binary data
by ww (Archbishop) on Oct 05, 2004 at 14:22 UTC

    further re DrHyde's offering: Tho he did not make it explicit, his approach offers a good first step for protecting yourself against embedded malware.

    Obvious? Maybe. Maybe that's already why you're checking the input. Or maybe the http: response is coming from a machine you control and thus, trust.

    But unless you're rilly, rilly POSITIVE! the incoming data is always going to be clean, you really do want to consider the obvious... very early in the game.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://396566]
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others studying the Monastery: (5)
As of 2018-01-18 09:07 GMT
Find Nodes?
    Voting Booth?
    How did you see in the new year?

    Results (208 votes). Check out past polls.