Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Re: How can I tell if a string contains binary data or plain-old text?

by dakkar (Hermit)
on Oct 31, 2003 at 14:43 UTC ( #303553=note: print w/ replies, xml ) Need Help??


in reply to How can I tell if a string contains binary data or plain-old text?

First of all: you can't have a "Unicode" file.

You can have a file containing Unicode code-points encoded in one of the transformation formats defined by the Unicode standard, such as UTF-8 or UTF-16.

So the question becomes:

I have a byte-stream. Is it a valid (ISO-8859-1|UTF-8|UTF-16)-encoded representation of some text?

This can be answered, since none of those encodings defines a meaning for each and every byte-sequence. But this is quite possibly not the answer you're looking for.

The way I see it, it's easier to check if your byte-stream contains something you know not to be text, using something like file(2) or File::MMagic as already suggested.

Doing it the other way ("is it a valid encoded form") gives you a lot of "this is text" when, in fact, it is nothing intelligible.

You could try to decode it and then do some heuristics to see if looks like text (ex. a lot of letters from the same script/writing system in a row, or something of the sort), but I think it's more trouble than it's worth.

-- 
        dakkar - Mobilis in mobile

Most of my code is tested...

Perl is strongly typed, it just has very few types (Dan)


Comment on Re: How can I tell if a string contains binary data or plain-old text?

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://303553]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (9)
As of 2015-07-07 11:08 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (88 votes), past polls