http://www.perlmonks.org?node_id=1034862

taint has asked for the wisdom of the Perl Monks concerning the following question:

Greetings everyone,
I'm trying to resurrect some old Perl/CGI scripts -- it was a Forum/Bulletin board. Problem is,
I lost my "clean" copy, and now I'm stuck
dealing with a copy that's been handled who-knows-who, and who-knows-what. So it's
been subjected to windows(office|word95|winword) || Macintosh(simple-text|some-other-mac-editor(s)) ||
who knows what else. As a result the files have probably been opened having a BOM in it, then
saved as windows-1252-1, then opened and saved as UTF-8, then saved as ISO-8859-1, then ?? -- well
you get the picture. I've tried running them through my handy dos2unix script to at least unify them
that much. I then ran the following:
#!/bin/sh - for i in $(find . -type f) do iconv -f ISO-8859-1 -t UTF-8 $i > $i.tmp rm $i mv $i.tmp $i done
which, of course assumes they're all ISO-8859-1 --which they are not.
I ran them through another script I cobbled utilizing FILE(1), eg file -i. That helped, but the results were still less-than-optimal. So when I finally felt I had managed to
unify them into a utf8 state, I began to edit them, only later to discover that there were
some little square boxes showing up in my editor. closer examination showed they were
0099 (hex), which are called "Single Graphic Character Introducer" - not very helpful, to
me anyway. I decided it would have to be "Perl to the rescue", and set out to find a way
to parse these files, and get more info (Perl is MUCH smarter than I am). I discovered the following:
#!/usr/bin/env perl # # unicount - count code points in input # Tom Christiansen <tchrist@perl.com> use v5.12; use strict; use sigtrap; use warnings; use charnames (); use Carp qw(carp croak confess cluck); use List::Util qw(max); use Unicode::UCD qw(charinfo charblock); sub fix_extension; sub process_input (&) ; sub set_encoding (*$); sub yuck ($) ; my $total = 0; my %seen = (); # deep magic here process_input { $total += length; $seen{$_}++ for split //; }; my $dec_width = length($total); my $hex_width = max(4, length sprintf("%x", max map { ord } keys %seen +)); for (sort keys %seen) { my $count = $seen{$_}; my $gcat = charinfo(ord())->{category}; my $name = charnames::viacode(ord()) || "<unnamed code point in @{[charblock(ord())]}>"; printf "%*d U+%0*X GC=%2s %s\n", $dec_width => $count, $hex_width => ord(), $gcat => $name; } exit; ################################################## sub yuck($) { my $errmsg = $_[0]; $errmsg =~ s/(?<=[^\n])\z/\n/; print STDERR "$0: $errmsg"; } sub process_input(&) { my $function = shift(); my $enc; if (@ARGV == 0 && -t STDIN && -t STDERR) { print STDERR "$0: reading from stdin, type ^D to end or ^C to +kill.\n"; } unshift(@ARGV, "-") if @ARGV == 0; FILE: for my $file (@ARGV) { # don't let magic open make an output handle next if -e $file && ! -f _; my $quasi_filename = fix_extension($file); $file = "standard input" if $file eq q(-); $quasi_filename =~ s/^(?=\s*[>|])/< /; no strict "refs"; my $fh = $file; # is *so* a lexical filehandle! ###98# unless (open($fh, $quasi_filename)) { yuck("couldn't open $quasi_filename: $!"); next FILE; } set_encoding($fh, $file) || next FILE; my $whole_file = eval { # could just do this a line at a time, but not if counting + \R's use warnings "FATAL" => "all"; local $/; scalar <$fh>; }; if ($@) { $@ =~ s/ at \K.*? line \d+.*/$file line $./; yuck($@); next FILE; } do { # much faster to alias than to copy local *_ = \$whole_file; &$function; }; unless (close $fh) { yuck("couldn't close $quasi_filename at line $.: $!"); next FILE; } } # foreach file } # Encoding set to (after unzipping): # if file.pod => use whatever =encoding says # elsif file.ENCODING for legal encoding name -> use that one # elsif file is binary => use bytes # else => use utf8 # # Note that gzipped stuff always shows up as bytes this way, but # it internal unzipped bytes are still counted after unzipping # sub set_encoding(*$) { my ($handle, $path) = @_; my $enc_name = (-f $path && -B $path) ? "bytes" : "utf8"; if ($path && $path =~ m{ \. ([^\s.]+) \z }x) { my $ext = $1; die unless defined $ext; if ($ext eq "pod") { my $int_enc = qx{ perl -C0 -lan -00 -e 'next unless /^=encoding/; print +\$F[1]; exit' $path }; if ($int_enc) { chomp $int_enc; $ext = $int_enc; ##print STDERR "$0: reset encoding to $ext on $path\n"; } } require Encode; if (my $enc_obj = Encode::find_encoding($ext)) { my $name = $enc_obj->name || $ext; $enc_name = "encoding($name)"; } } return 1 if eval { use warnings FATAL => "all"; no strict "refs"; ##print STDERR qq(binmode($handle, ":$enc_name")\n); binmode($handle, ":$enc_name") || die "binmode to $enc_name fa +iled"; 1; }; for ($@) { s/ at .* line \d+\.//; s/$/ for $path/; } yuck("set_encoding: $@"); return undef; } sub fix_extension { my $path = shift(); my %Compress = ( Z => "zcat", z => "gzcat", # for uncompressing gz => "gzcat", bz => "bzcat", bz2 => "bzcat", bzip => "bzcat", bzip2 => "bzcat", lzma => "lzcat", ); if ($path =~ m{ \. ( [^.\s] +) \z }x) { if (my $prog = $Compress{$1}) { # HIP HIP HURRAY! for magic open!!! # HIP HIP HURRAY! for magic open!!! # HIP HIP HURRAY! for magic open!!! return "$prog $path |"; } } return $path; } END { close(STDIN) || die "couldn't close stdin: $!"; close(STDOUT) || die "couldn't close stdout: $!"; } UNITCHECK { $SIG{ PIPE } = sub { exit }; $SIG{__WARN__} = sub { confess "trapped uncaught warning" unless $^S; }; }
which, while not necessarily it's intended use, did shed some further info.
It dumped the following info:
utf8 "\x99" does not map to Unicode at ./word_lets.cgi line 1
Well, after much further research, I discover that particular character, is
the tm in Latin-1, or ™ (&#8482;) in decimal, using UTF-8.
Now, I'd just stop there, and send Perl, Find, Grep, Cat, or Awk on a seek-and-replace
mission. Then be done with it. But I'm sure this (that) isn't the last of them.
It all wouldn't be such a big deal, except I have over one hundred files to deal with.
Surely I'm not the only one that's had to overcome something like this. I did spend
quite some time trying to find a solution reading all the perldoc's. While there was much to
learned regarding :unicode && :utf-8 | :utf8, last time I tried to slurp a file in, and modify it
within Perl using unicode | or utf8, I ended up with ms-dos/windows line endings (CR/LF), and I'm
on a BSD-UNIX machine. :(

Any, and all help/pointers greatly appreciated.

Thank you for all your consideration.

--chris

#!/usr/bin/perl -Tw
use perl::always;
my $perl_version = "5.12.4";
print $perl_version;

Replies are listed 'Best First'.
Re: Can Perl convert ISO-? | WIN-? | MAC-? to UTF-8?
by choroba (Cardinal) on May 23, 2013 at 06:03 UTC
    What exactly is your question?

    Maybe this can help you: When dealing with strange UTF-8 documents, I often use the following bash script.

    #! /bin/bash ## Lists all nonASCII UTF-8 characters contained in the data, for each ## character it gives the number of occurences in each file and an ## example. ## Author: E. Choroba export LC_ALL=C codes=() for code in c{{0..9},{a..f}} d{{0..9},{a..f}} ; do codes+=($(eval grep -ho "$'\x$code'". "$@" | sort -u)) done for code in e{{0..9},{a..f}} ; do codes+=($(eval grep -ho "$'\x$code'".. "$@" | sort -u)) done for code in f{0..4} ; do codes+=($(eval grep -ho "$'\x$code'"... "$@" | sort -u)) done for code in "${codes[@]}" ; do hexdump <<< "$code" | sed '2d;s=000a==;s= 0a==' done \ | cut -f2 -d' ' \ | sed '/^....$/s=\(..\)\(..\)=\\x\2\\x\1=; /^......$/s=\(..\)\(..\)\(..\)=\\x\2\\x\1\\x\3=' \ | while read -r code ; do echo $code eval grep -c "$'$code'" "$@" eval grep -m1 --color=always "$'$code'" "$@" done
    I know it is ugly, but it works: it lists all the non-ASCII characters in the given files with counts and examples. It depends on the hexdump utility whose documentation says the following:
    The hexdump command is part of the util-linux package and is available from Linux Kernel Archive ⟨ftp://ftp.kernel.org/pub/linux/utils/util-linux/⟩.
    لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ
      Greetings choroba, and thank you for your reply.
      My question is:
      If I have a mass of (textual) files, that have mixed encoding(s) && line-endings,
      aside from ICONV(1) | FILE(1), how can I unify them -- convert them all to
      UTF-8 | UTF8?
      Given what I do know about Perl, I should be able to slurp them, process
      the contents, and spit them out as "unified" -- see UTF8 text files, all having the same line-endings.
      Given that the files I'd be slurping, are of mixed "types", is there any way to process
      them, so they all end up the same "type" when they're done?

      I hope I was clearer || more concise this time. :)

      Thanks again, for your response.

      --chris

      #!/usr/bin/perl -Tw
      use perl::always;
      my $perl_version = "5.12.4";
      print $perl_version;
Re: Can Perl convert ISO-? | WIN-? | MAC-? to UTF-8?
by Anonymous Monk on May 23, 2013 at 06:40 UTC
      Greetings, and thank you for your reply.
      While this is nearly the same output I received running the Perl script I posted.
      The script merely indicated that Unicode::UCD couldn't properly map "\x99" (0099) | "&8482;" (in Decimal),
      to a Unicode symbol/entity. In all likelyhood, it was because the document wasn't properly encoded
      (windows-1252-1|ISO-8859-1), instead of UTF-8|UTF8. I've examined enough of the documents
      to know that they aren't "junk", but rather UTF-8 encoded files that weren't saved accordingly.
      So, knowing that Perl is quite Unicode|UTF-8 savvy, I was hoping I could find
      a way to let Perl discover it's current incorrect encoding -- say ISO-8859-1, and either
      convert the embedded symbols to their Decimal equivalent, or, if it's safe, to save it as UTF-8.
      In fact, after saving that same document as UTF-8, and running that script on it, caused
      the script to emit that error. Reading that same document with the embedded symbols/characters in it, while being
      ISO-8859-1 with that script emitted:
      6 U+0009 GC=Cc CHARACTER TABULATION 2564 U+000A GC=Cc LINE FEED (LF) 25209 U+0020 GC=Zs SPACE 8436 U+0021 GC=Po EXCLAMATION MARK 167 U+0022 GC=Po QUOTATION MARK 35 U+0023 GC=Po NUMBER SIGN 7 U+0024 GC=Sc DOLLAR SIGN 1140 U+0025 GC=Po PERCENT SIGN 46 U+0026 GC=Po AMPERSAND 108 U+0027 GC=Po APOSTROPHE 134 U+0028 GC=Ps LEFT PARENTHESIS 134 U+0029 GC=Pe RIGHT PARENTHESIS 14 U+002A GC=Po ASTERISK 2751 U+002C GC=Po COMMA 439 U+002D GC=Pd HYPHEN-MINUS 1655 U+002E GC=Po FULL STOP 518 U+002F GC=Po SOLIDUS 73 U+0030 GC=Nd DIGIT ZERO 91 U+0031 GC=Nd DIGIT ONE 107 U+0032 GC=Nd DIGIT TWO 53 U+0033 GC=Nd DIGIT THREE 30 U+0034 GC=Nd DIGIT FOUR 49 U+0035 GC=Nd DIGIT FIVE 13 U+0036 GC=Nd DIGIT SIX 5 U+0037 GC=Nd DIGIT SEVEN 21 U+0038 GC=Nd DIGIT EIGHT 12 U+0039 GC=Nd DIGIT NINE 331 U+003A GC=Po COLON 43 U+003B GC=Po SEMICOLON 714 U+003C GC=Sm LESS-THAN SIGN 2176 U+003D GC=Sm EQUALS SIGN 2853 U+003E GC=Sm GREATER-THAN SIGN 103 U+003F GC=Po QUESTION MARK 4 U+0040 GC=Po COMMERCIAL AT 665 U+0041 GC=Lu LATIN CAPITAL LETTER A 547 U+0042 GC=Lu LATIN CAPITAL LETTER B 370 U+0043 GC=Lu LATIN CAPITAL LETTER C 331 U+0044 GC=Lu LATIN CAPITAL LETTER D 625 U+0045 GC=Lu LATIN CAPITAL LETTER E 323 U+0046 GC=Lu LATIN CAPITAL LETTER F 104 U+0047 GC=Lu LATIN CAPITAL LETTER G 171 U+0048 GC=Lu LATIN CAPITAL LETTER H 509 U+0049 GC=Lu LATIN CAPITAL LETTER I 32 U+004A GC=Lu LATIN CAPITAL LETTER J 83 U+004B GC=Lu LATIN CAPITAL LETTER K 378 U+004C GC=Lu LATIN CAPITAL LETTER L 594 U+004D GC=Lu LATIN CAPITAL LETTER M 520 U+004E GC=Lu LATIN CAPITAL LETTER N 410 U+004F GC=Lu LATIN CAPITAL LETTER O 653 U+0050 GC=Lu LATIN CAPITAL LETTER P 39 U+0051 GC=Lu LATIN CAPITAL LETTER Q 623 U+0052 GC=Lu LATIN CAPITAL LETTER R 564 U+0053 GC=Lu LATIN CAPITAL LETTER S 912 U+0054 GC=Lu LATIN CAPITAL LETTER T 486 U+0055 GC=Lu LATIN CAPITAL LETTER U 89 U+0056 GC=Lu LATIN CAPITAL LETTER V 196 U+0057 GC=Lu LATIN CAPITAL LETTER W 8 U+0058 GC=Lu LATIN CAPITAL LETTER X 394 U+0059 GC=Lu LATIN CAPITAL LETTER Y 4 U+005A GC=Lu LATIN CAPITAL LETTER Z 21 U+005B GC=Ps LEFT SQUARE BRACKET 21 U+005D GC=Pe RIGHT SQUARE BRACKET 5 U+005E GC=Sk CIRCUMFLEX ACCENT 4766 U+005F GC=Pc LOW LINE 10143 U+0061 GC=Ll LATIN SMALL LETTER A 2570 U+0062 GC=Ll LATIN SMALL LETTER B 4103 U+0063 GC=Ll LATIN SMALL LETTER C 4907 U+0064 GC=Ll LATIN SMALL LETTER D 16937 U+0065 GC=Ll LATIN SMALL LETTER E 2591 U+0066 GC=Ll LATIN SMALL LETTER F 2564 U+0067 GC=Ll LATIN SMALL LETTER G 3859 U+0068 GC=Ll LATIN SMALL LETTER H 9548 U+0069 GC=Ll LATIN SMALL LETTER I 87 U+006A GC=Ll LATIN SMALL LETTER J 502 U+006B GC=Ll LATIN SMALL LETTER K 6444 U+006C GC=Ll LATIN SMALL LETTER L 4640 U+006D GC=Ll LATIN SMALL LETTER M 7574 U+006E GC=Ll LATIN SMALL LETTER N 10936 U+006F GC=Ll LATIN SMALL LETTER O 4417 U+0070 GC=Ll LATIN SMALL LETTER P 4481 U+0071 GC=Ll LATIN SMALL LETTER Q 10310 U+0072 GC=Ll LATIN SMALL LETTER R 10046 U+0073 GC=Ll LATIN SMALL LETTER S 11385 U+0074 GC=Ll LATIN SMALL LETTER T 4523 U+0075 GC=Ll LATIN SMALL LETTER U 1888 U+0076 GC=Ll LATIN SMALL LETTER V 1574 U+0077 GC=Ll LATIN SMALL LETTER W 537 U+0078 GC=Ll LATIN SMALL LETTER X 2773 U+0079 GC=Ll LATIN SMALL LETTER Y 80 U+007A GC=Ll LATIN SMALL LETTER Z 19 U+007B GC=Ps LEFT CURLY BRACKET 10 U+007C GC=Sm VERTICAL LINE 19 U+007D GC=Pe RIGHT CURLY BRACKET 207 U+007E GC=Sm TILDE 55 U+0099 GC=Cc <unnamed code point in Latin-1 Supplement> 3 U+00A0 GC=Zs NO-BREAK SPACE
      (55 U+0099 GC=Cc <unnamed code point in Latin-1 Supplement>)
      being the offending symbol/character.
      Anyway, I see you've provided some other possibilities. So I'd probably do well to further investigate them.

      Thanks again, for taking the time to respond.

      --chris

      #!/usr/bin/perl -Tw
      use perl::always;
      my perl_version = "5.12.4";
      print $perl_version;
Re: Can Perl convert ISO-? | WIN-? | MAC-? to UTF-8?
by Khen1950fx (Canon) on May 23, 2013 at 06:04 UTC
    I think that you might need to use String::BOM.
    #!/usr/bin/perl -l use strict; use warnings; use String::BOM qw(strip_bom_from_file); my $file = '/path/to/file'; print strip_bom_from_file($file);
      Greetings Khen1950fx , and thank you for your reply.
      That's good advice. But if I understand what you've offered, it would strip the BOM
      (ByteOrderMarker) from the file(s). While that was probably the case somewhere in their
      past. None of them appear to have one now.

      Thank you again, for your response

      --chris

      #!/usr/bin/perl -Tw
      use perl::always;
      my $perl_version = "5.12.4";
      print $perl_version;
      A reply falls below the community's threshold of quality. You may see it by logging in.

      I think that you might need to use String::BOM

      Why do you think that? Which portion of the OPs question is that supposed to solve?