Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical

Re: Can Perl convert ISO-? | WIN-? | MAC-? to UTF-8?

by choroba (Chancellor)
on May 23, 2013 at 06:03 UTC ( #1034874=note: print w/replies, xml ) Need Help??

in reply to Can Perl convert ISO-? | WIN-? | MAC-? to UTF-8?

What exactly is your question?

Maybe this can help you: When dealing with strange UTF-8 documents, I often use the following bash script.

#! /bin/bash ## Lists all nonASCII UTF-8 characters contained in the data, for each ## character it gives the number of occurences in each file and an ## example. ## Author: E. Choroba export LC_ALL=C codes=() for code in c{{0..9},{a..f}} d{{0..9},{a..f}} ; do codes+=($(eval grep -ho "$'\x$code'". "$@" | sort -u)) done for code in e{{0..9},{a..f}} ; do codes+=($(eval grep -ho "$'\x$code'".. "$@" | sort -u)) done for code in f{0..4} ; do codes+=($(eval grep -ho "$'\x$code'"... "$@" | sort -u)) done for code in "${codes[@]}" ; do hexdump <<< "$code" | sed '2d;s=000a==;s= 0a==' done \ | cut -f2 -d' ' \ | sed '/^....$/s=\(..\)\(..\)=\\x\2\\x\1=; /^......$/s=\(..\)\(..\)\(..\)=\\x\2\\x\1\\x\3=' \ | while read -r code ; do echo $code eval grep -c "$'$code'" "$@" eval grep -m1 --color=always "$'$code'" "$@" done
I know it is ugly, but it works: it lists all the non-ASCII characters in the given files with counts and examples. It depends on the hexdump utility whose documentation says the following:
The hexdump command is part of the util-linux package and is available from Linux Kernel Archive ⟨⟩.
لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ

Replies are listed 'Best First'.
Re^2: Can Perl convert ISO-? | WIN-? | MAC-? to UTF-8?
by taint (Chaplain) on May 23, 2013 at 06:21 UTC
    Greetings choroba, and thank you for your reply.
    My question is:
    If I have a mass of (textual) files, that have mixed encoding(s) && line-endings,
    aside from ICONV(1) | FILE(1), how can I unify them -- convert them all to
    UTF-8 | UTF8?
    Given what I do know about Perl, I should be able to slurp them, process
    the contents, and spit them out as "unified" -- see UTF8 text files, all having the same line-endings.
    Given that the files I'd be slurping, are of mixed "types", is there any way to process
    them, so they all end up the same "type" when they're done?

    I hope I was clearer || more concise this time. :)

    Thanks again, for your response.


    #!/usr/bin/perl -Tw
    use perl::always;
    my $perl_version = "5.12.4";
    print $perl_version;

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1034874]
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others surveying the Monastery: (4)
As of 2017-08-24 04:22 GMT
Find Nodes?
    Voting Booth?
    Who is your favorite scientist and why?

    Results (364 votes). Check out past polls.