Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

Re: Can Perl convert ISO-? | WIN-? | MAC-? to UTF-8?

by choroba (Canon)
on May 23, 2013 at 06:03 UTC ( #1034874=note: print w/ replies, xml ) Need Help??


in reply to Can Perl convert ISO-? | WIN-? | MAC-? to UTF-8?

What exactly is your question?

Maybe this can help you: When dealing with strange UTF-8 documents, I often use the following bash script.

#! /bin/bash ## Lists all nonASCII UTF-8 characters contained in the data, for each ## character it gives the number of occurences in each file and an ## example. ## Author: E. Choroba export LC_ALL=C codes=() for code in c{{0..9},{a..f}} d{{0..9},{a..f}} ; do codes+=($(eval grep -ho "$'\x$code'". "$@" | sort -u)) done for code in e{{0..9},{a..f}} ; do codes+=($(eval grep -ho "$'\x$code'".. "$@" | sort -u)) done for code in f{0..4} ; do codes+=($(eval grep -ho "$'\x$code'"... "$@" | sort -u)) done for code in "${codes[@]}" ; do hexdump <<< "$code" | sed '2d;s=000a==;s= 0a==' done \ | cut -f2 -d' ' \ | sed '/^....$/s=\(..\)\(..\)=\\x\2\\x\1=; /^......$/s=\(..\)\(..\)\(..\)=\\x\2\\x\1\\x\3=' \ | while read -r code ; do echo $code eval grep -c "$'$code'" "$@" eval grep -m1 --color=always "$'$code'" "$@" done
I know it is ugly, but it works: it lists all the non-ASCII characters in the given files with counts and examples. It depends on the hexdump utility whose documentation says the following:
The hexdump command is part of the util-linux package and is available from Linux Kernel Archive ⟨ftp://ftp.kernel.org/pub/linux/utils/util-linux/⟩.
لսႽ ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ


Comment on Re: Can Perl convert ISO-? | WIN-? | MAC-? to UTF-8?
Select or Download Code
Re^2: Can Perl convert ISO-? | WIN-? | MAC-? to UTF-8?
by taint (Chaplain) on May 23, 2013 at 06:21 UTC
    Greetings choroba, and thank you for your reply.
    My question is:
    If I have a mass of (textual) files, that have mixed encoding(s) && line-endings,
    aside from ICONV(1) | FILE(1), how can I unify them -- convert them all to
    UTF-8 | UTF8?
    Given what I do know about Perl, I should be able to slurp them, process
    the contents, and spit them out as "unified" -- see UTF8 text files, all having the same line-endings.
    Given that the files I'd be slurping, are of mixed "types", is there any way to process
    them, so they all end up the same "type" when they're done?

    I hope I was clearer || more concise this time. :)

    Thanks again, for your response.

    --chris

    #!/usr/bin/perl -Tw
    use perl::always;
    my $perl_version = "5.12.4";
    print $perl_version;

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1034874]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others surveying the Monastery: (7)
As of 2015-07-07 06:02 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (87 votes), past polls