Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation

Comment on

( #3333=superdoc: print w/ replies, xml ) Need Help??
I've been thinking about grabbing the 1st "bunch" of characters, and determining if they're within the printable range, however this method is not fool proof.

heh... Printable in what language(s), using what character encoding(s)?

Does uuencoded or base64 encoded data count as "printable", or as "binary"?

I'll confess to being clueless about the details of TCP, and ask a presumably stupid question: how would the re-assembly of packets need to differ, based on whether or not you consider the content to be "binary"?

Doing a more extensive analysis using dipthongs/whitespace/vowels etc i think would be too slow.

If you're talking about deciding whether or not the content would qualify as "human-readable text", again, I'm hindered by ignorance of TCP (what's the typical packet size?) -- and I'd have to repeat the earlier questions (which human language(s)? which character encoding(s)?) -- but modeling readable text, in terms of the relative probabilities of occurrence for individual characters, would not be very hard, and could be quite robust with test strings of as few as 32 characters (the more, the better, of course).

Essentially, you "train" one model on some suitably large set of known human-readable text (just 10K words would probably do), consisting of the probabilities for each printable character; then train another model on a (preferably larger) set of data known to contain little or no readable text (or maybe just assume equal probabilities for all printable byte values).

For a given stream of input data to be classified, if it contains non-printables, it's probably not text and you're probably done; but if it contains only printable characters (e.g. could be base64 encoded), compute the relative proporions of occurrence over the set of printable characters, and measure the error between these proportions and each of the two models. If the error relative to the human-readable model is significantly lower, the input is human readable. (Unless of course it's spam, which is often tailored to match the unigram character statistics of a language, without regard to readability...)

If you are worried about speed, though, you'd be better off doing it in C rather than Perl.

(update: In case the question about character encoding didn't make this clear: the modeling of human-readable text would need to be limited to a training set that was homogeneous, at least with respect to character encoding. If your "human" training data includes a mix of UTF16, UTF8, GB2312, Big5, ShiftJIS, etc, it's going to end up not that different from the "binary" model. And if we're talking about any flavor of unicode, you also need to limit yourself to a given language (or group of closely related languages) -- for one thing, the definition of 'what is printable' varies widely...)

In reply to Re: Sniffing binary data, heuristics? by graff
in thread Sniffing binary data, heuristics? by Ryszard

Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":

  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • Outside of code tags, you may need to use entities for some characters:
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?

    What's my password?
    Create A New User
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others taking refuge in the Monastery: (4)
    As of 2014-09-19 22:04 GMT
    Find Nodes?
      Voting Booth?

      How do you remember the number of days in each month?

      Results (147 votes), past polls