|Keep It Simple, Stupid|
I've been thinking about grabbing the 1st "bunch" of characters, and determining if they're within the printable range, however this method is not fool proof.
heh... Printable in what language(s), using what character encoding(s)?
Does uuencoded or base64 encoded data count as "printable", or as "binary"?
I'll confess to being clueless about the details of TCP, and ask a presumably stupid question: how would the re-assembly of packets need to differ, based on whether or not you consider the content to be "binary"?
Doing a more extensive analysis using dipthongs/whitespace/vowels etc i think would be too slow.
If you're talking about deciding whether or not the content would qualify as "human-readable text", again, I'm hindered by ignorance of TCP (what's the typical packet size?) -- and I'd have to repeat the earlier questions (which human language(s)? which character encoding(s)?) -- but modeling readable text, in terms of the relative probabilities of occurrence for individual characters, would not be very hard, and could be quite robust with test strings of as few as 32 characters (the more, the better, of course).
Essentially, you "train" one model on some suitably large set of known human-readable text (just 10K words would probably do), consisting of the probabilities for each printable character; then train another model on a (preferably larger) set of data known to contain little or no readable text (or maybe just assume equal probabilities for all printable byte values).
For a given stream of input data to be classified, if it contains non-printables, it's probably not text and you're probably done; but if it contains only printable characters (e.g. could be base64 encoded), compute the relative proporions of occurrence over the set of printable characters, and measure the error between these proportions and each of the two models. If the error relative to the human-readable model is significantly lower, the input is human readable. (Unless of course it's spam, which is often tailored to match the unigram character statistics of a language, without regard to readability...)
If you are worried about speed, though, you'd be better off doing it in C rather than Perl.
(update: In case the question about character encoding didn't make this clear: the modeling of human-readable text would need to be limited to a training set that was homogeneous, at least with respect to character encoding. If your "human" training data includes a mix of UTF16, UTF8, GB2312, Big5, ShiftJIS, etc, it's going to end up not that different from the "binary" model. And if we're talking about any flavor of unicode, you also need to limit yourself to a given language (or group of closely related languages) -- for one thing, the definition of 'what is printable' varies widely...)