Beefy Boxes and Bandwidth Generously Provided by pair Networks Joe
Just another Perl shrine
 
PerlMonks  

Re: Can Perl convert ISO-? | WIN-? | MAC-? to UTF-8?

by choroba (Abbot)
on May 23, 2013 at 06:03 UTC ( #1034874=note: print w/ replies, xml ) Need Help??


in reply to Can Perl convert ISO-? | WIN-? | MAC-? to UTF-8?

What exactly is your question?

Maybe this can help you: When dealing with strange UTF-8 documents, I often use the following bash script.

#! /bin/bash ## Lists all nonASCII UTF-8 characters contained in the data, for each ## character it gives the number of occurences in each file and an ## example. ## Author: E. Choroba export LC_ALL=C codes=() for code in c{{0..9},{a..f}} d{{0..9},{a..f}} ; do codes+=($(eval grep -ho "$'\x$code'". "$@" | sort -u)) done for code in e{{0..9},{a..f}} ; do codes+=($(eval grep -ho "$'\x$code'".. "$@" | sort -u)) done for code in f{0..4} ; do codes+=($(eval grep -ho "$'\x$code'"... "$@" | sort -u)) done for code in "${codes[@]}" ; do hexdump <<< "$code" | sed '2d;s=000a==;s= 0a==' done \ | cut -f2 -d' ' \ | sed '/^....$/s=\(..\)\(..\)=\\x\2\\x\1=; /^......$/s=\(..\)\(..\)\(..\)=\\x\2\\x\1\\x\3=' \ | while read -r code ; do echo $code eval grep -c "$'$code'" "$@" eval grep -m1 --color=always "$'$code'" "$@" done
I know it is ugly, but it works: it lists all the non-ASCII characters in the given files with counts and examples. It depends on the hexdump utility whose documentation says the following:
The hexdump command is part of the util-linux package and is available from Linux Kernel Archive ⟨ftp://ftp.kernel.org/pub/linux/utils/util-linux/⟩.
لսႽ ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ


Comment on Re: Can Perl convert ISO-? | WIN-? | MAC-? to UTF-8?
Select or Download Code
Re^2: Can Perl convert ISO-? | WIN-? | MAC-? to UTF-8?
by taint (Hermit) on May 23, 2013 at 06:21 UTC
    Greetings choroba, and thank you for your reply.
    My question is:
    If I have a mass of (textual) files, that have mixed encoding(s) && line-endings,
    aside from ICONV(1) | FILE(1), how can I unify them -- convert them all to
    UTF-8 | UTF8?
    Given what I do know about Perl, I should be able to slurp them, process
    the contents, and spit them out as "unified" -- see UTF8 text files, all having the same line-endings.
    Given that the files I'd be slurping, are of mixed "types", is there any way to process
    them, so they all end up the same "type" when they're done?

    I hope I was clearer || more concise this time. :)

    Thanks again, for your response.

    --chris

    #!/usr/bin/perl -Tw
    use perl::always;
    my $perl_version = "5.12.4";
    print $perl_version;

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1034874]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others about the Monastery: (4)
As of 2014-04-20 03:20 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    April first is:







    Results (485 votes), past polls