I have had some success using letter pair frequencies as a language identifier for OCR'd text -- letter pair distribution is a much better metric than just letter distribution. Pick out all of the two-letter pairs in your string, and compare the ten most frequent pairs against some pairs you got from analyzing a training text ( just remember to take out the whitespace from the training text first ). If it exceeds some similarity threshold, it is 'real' text, otherwise gibberish.
in reply to Re: Junk NOT words
in thread Junk NOT words
For example, my own values for most common letter pairs in English are these:
English => ['he','th','in','er','an','ou'],
I have found that a better than 50% match is a pretty reliable indicator
Be aware that the path you are going down quickly leads to AI-complete problems in natural language processing. This is another one of those programming tasks that seems very easy until you try to do it on real data.