Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW

Re: Distinguishing text from binary data

by DrHyde (Prior)
on Oct 05, 2004 at 13:14 UTC ( #396566=note: print w/ replies, xml ) Need Help??

in reply to Distinguishing text from binary data

In general you need to read the whole string one character at a time and if you come across something that doesn't make sense in your character encoding, it's binary. Otherwise it's text. In the case of ASCII text, the characters that don't make sense are most of the control characters and 0x7F - 0xFF. The following control characters are usually considered OK in text data:
  • 0x09 - tab
  • 0x0A - line feed
  • 0x0C - form feed
  • 0x0D - carriage return
A regex-ish way of detecting non-ASCII data based on that might be:
print ($text =~ /[^\x09\x0a\x0c\x0d\x20-\x7e]/) ? "binary\n" : "text\n";
Comment on Re: Distinguishing text from binary data
Download Code
Replies are listed 'Best First'.
Re^2: Distinguishing text from binary data
by ww (Bishop) on Oct 05, 2004 at 14:22 UTC

    further re DrHyde's offering: Tho he did not make it explicit, his approach offers a good first step for protecting yourself against embedded malware.

    Obvious? Maybe. Maybe that's already why you're checking the input. Or maybe the http: response is coming from a machine you control and thus, trust.

    But unless you're rilly, rilly POSITIVE! the incoming data is always going to be clean, you really do want to consider the obvious... very early in the game.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://396566]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others exploiting the Monastery: (3)
As of 2016-05-28 13:50 GMT
Find Nodes?
    Voting Booth?