Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

find out file charset and encoding?

by DreamT (Pilgrim)
on Aug 06, 2011 at 10:43 UTC ( [id://918915]=perlquestion: print w/replies, xml ) Need Help??

DreamT has asked for the wisdom of the Perl Monks concerning the following question:

Dear monks,


I want to make sure that a file is encoded to latin1 and has Unix line endings. How do I do it?

In the long run I want to be able to understand the encoding caveats, so if anyone can explain it to me or point me to the right information. My main concern right now is to solve the issue above.

Replies are listed 'Best First'.
Re: find out file charset and encoding?
by GrandFather (Saint) on Aug 06, 2011 at 11:35 UTC

    Actually you can't make sure a file is encoded in any particular way, you can only show that it isn't. However, it is fairly likely that a text file would show the nature of its line endings at least (assuming them to be consistent) within a modest number of characters - say a few hundred. *nix uses the line feed character \n as a line ending character. The other common line endings are Windows (carriage return, linefeed: \r\n) and Mac (carriage return: \r). Note that Perl translates the OS specific line ending sequence into a character represented by \n for files opened using default processing so \n may be used as the line end character across platforms. Thus, to determine the actual line ending character sequence used by a file it may be necessary to use binmode to ensure no line end translation takes place.

    Ensuring you have latin1 (probably you mean ISO/IEC 8859-1) is much harder and probably requires that the file contain some suitable foreign language text that you can check against an appropriate dictionary.

    However it may be that all you require is to check that the file is not inconsistent with it using some particular character coding. It may help to take a look at ISO/IEC_8859-1.

    True laziness is hard work
Re: find out file charset and encoding?
by Khen1950fx (Canon) on Aug 06, 2011 at 16:21 UTC
    Take a look at piconv. It's an very useful tool. For example, you want to resolve the alias "latin1"
    piconv -r latin1
    It'll return the canonical name iso-8859-1. If you want to make sure that it's encoded in latin1
    piconv -t iso-8859-1 $file
    It'll print the iso-8859-1 file to STDOUT.
Re: find out file charset and encoding?
by cormanaz (Deacon) on Aug 06, 2011 at 19:46 UTC

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://918915]
Approved by GrandFather
Front-paged by toolic
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others goofing around in the Monastery: (2)
As of 2024-04-20 05:31 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found