taint has asked for the wisdom of the Perl Monks concerning the following question:
Greetings everyone,
I'm trying to resurrect some old Perl/CGI scripts -- it was a Forum/Bulletin board. Problem is,
I lost my "clean" copy, and now I'm stuck
dealing with a copy that's been handled who-knows-who, and who-knows-what. So it's
been subjected to windows(office|word95|winword) || Macintosh(simple-text|some-other-mac-editor(s)) ||
who knows what else. As a result the files have probably been opened having a BOM in it, then
saved as windows-1252-1, then opened and saved as UTF-8, then saved as ISO-8859-1, then ?? -- well
you get the picture. I've tried running them through my handy dos2unix script to at least unify them
that much. I then ran the following:
I ran them through another script I cobbled utilizing FILE(1), eg file -i. That helped, but the results were still less-than-optimal. So when I finally felt I had managed to
unify them into a utf8 state, I began to edit them, only later to discover that there were
some little square boxes showing up in my editor. closer examination showed they were
0099 (hex), which are called "Single Graphic Character Introducer" - not very helpful, to
me anyway. I decided it would have to be "Perl to the rescue", and set out to find a way
to parse these files, and get more info (Perl is MUCH smarter than I am). I discovered the following:
It dumped the following info:
the tm in Latin-1, or ™ (™) in decimal, using UTF-8.
Now, I'd just stop there, and send Perl, Find, Grep, Cat, or Awk on a seek-and-replace
mission. Then be done with it. But I'm sure this (that) isn't the last of them.
It all wouldn't be such a big deal, except I have over one hundred files to deal with.
Surely I'm not the only one that's had to overcome something like this. I did spend
quite some time trying to find a solution reading all the perldoc's. While there was much to
learned regarding :unicode && :utf-8 | :utf8, last time I tried to slurp a file in, and modify it
within Perl using unicode | or utf8, I ended up with ms-dos/windows line endings (CR/LF), and I'm
on a BSD-UNIX machine. :(
Any, and all help/pointers greatly appreciated.
Thank you for all your consideration.
--chris
I'm trying to resurrect some old Perl/CGI scripts -- it was a Forum/Bulletin board. Problem is,
I lost my "clean" copy, and now I'm stuck
dealing with a copy that's been handled who-knows-who, and who-knows-what. So it's
been subjected to windows(office|word95|winword) || Macintosh(simple-text|some-other-mac-editor(s)) ||
who knows what else. As a result the files have probably been opened having a BOM in it, then
saved as windows-1252-1, then opened and saved as UTF-8, then saved as ISO-8859-1, then ?? -- well
you get the picture. I've tried running them through my handy dos2unix script to at least unify them
that much. I then ran the following:
which, of course assumes they're all ISO-8859-1 --which they are not.#!/bin/sh - for i in $(find . -type f) do iconv -f ISO-8859-1 -t UTF-8 $i > $i.tmp rm $i mv $i.tmp $i done
I ran them through another script I cobbled utilizing FILE(1), eg file -i. That helped, but the results were still less-than-optimal. So when I finally felt I had managed to
unify them into a utf8 state, I began to edit them, only later to discover that there were
some little square boxes showing up in my editor. closer examination showed they were
0099 (hex), which are called "Single Graphic Character Introducer" - not very helpful, to
me anyway. I decided it would have to be "Perl to the rescue", and set out to find a way
to parse these files, and get more info (Perl is MUCH smarter than I am). I discovered the following:
which, while not necessarily it's intended use, did shed some further info.#!/usr/bin/env perl # # unicount - count code points in input # Tom Christiansen <tchrist@perl.com> use v5.12; use strict; use sigtrap; use warnings; use charnames (); use Carp qw(carp croak confess cluck); use List::Util qw(max); use Unicode::UCD qw(charinfo charblock); sub fix_extension; sub process_input (&) ; sub set_encoding (*$); sub yuck ($) ; my $total = 0; my %seen = (); # deep magic here process_input { $total += length; $seen{$_}++ for split //; }; my $dec_width = length($total); my $hex_width = max(4, length sprintf("%x", max map { ord } keys %seen +)); for (sort keys %seen) { my $count = $seen{$_}; my $gcat = charinfo(ord())->{category}; my $name = charnames::viacode(ord()) || "<unnamed code point in @{[charblock(ord())]}>"; printf "%*d U+%0*X GC=%2s %s\n", $dec_width => $count, $hex_width => ord(), $gcat => $name; } exit; ################################################## sub yuck($) { my $errmsg = $_[0]; $errmsg =~ s/(?<=[^\n])\z/\n/; print STDERR "$0: $errmsg"; } sub process_input(&) { my $function = shift(); my $enc; if (@ARGV == 0 && -t STDIN && -t STDERR) { print STDERR "$0: reading from stdin, type ^D to end or ^C to +kill.\n"; } unshift(@ARGV, "-") if @ARGV == 0; FILE: for my $file (@ARGV) { # don't let magic open make an output handle next if -e $file && ! -f _; my $quasi_filename = fix_extension($file); $file = "standard input" if $file eq q(-); $quasi_filename =~ s/^(?=\s*[>|])/< /; no strict "refs"; my $fh = $file; # is *so* a lexical filehandle! ###98# unless (open($fh, $quasi_filename)) { yuck("couldn't open $quasi_filename: $!"); next FILE; } set_encoding($fh, $file) || next FILE; my $whole_file = eval { # could just do this a line at a time, but not if counting + \R's use warnings "FATAL" => "all"; local $/; scalar <$fh>; }; if ($@) { $@ =~ s/ at \K.*? line \d+.*/$file line $./; yuck($@); next FILE; } do { # much faster to alias than to copy local *_ = \$whole_file; &$function; }; unless (close $fh) { yuck("couldn't close $quasi_filename at line $.: $!"); next FILE; } } # foreach file } # Encoding set to (after unzipping): # if file.pod => use whatever =encoding says # elsif file.ENCODING for legal encoding name -> use that one # elsif file is binary => use bytes # else => use utf8 # # Note that gzipped stuff always shows up as bytes this way, but # it internal unzipped bytes are still counted after unzipping # sub set_encoding(*$) { my ($handle, $path) = @_; my $enc_name = (-f $path && -B $path) ? "bytes" : "utf8"; if ($path && $path =~ m{ \. ([^\s.]+) \z }x) { my $ext = $1; die unless defined $ext; if ($ext eq "pod") { my $int_enc = qx{ perl -C0 -lan -00 -e 'next unless /^=encoding/; print +\$F[1]; exit' $path }; if ($int_enc) { chomp $int_enc; $ext = $int_enc; ##print STDERR "$0: reset encoding to $ext on $path\n"; } } require Encode; if (my $enc_obj = Encode::find_encoding($ext)) { my $name = $enc_obj->name || $ext; $enc_name = "encoding($name)"; } } return 1 if eval { use warnings FATAL => "all"; no strict "refs"; ##print STDERR qq(binmode($handle, ":$enc_name")\n); binmode($handle, ":$enc_name") || die "binmode to $enc_name fa +iled"; 1; }; for ($@) { s/ at .* line \d+\.//; s/$/ for $path/; } yuck("set_encoding: $@"); return undef; } sub fix_extension { my $path = shift(); my %Compress = ( Z => "zcat", z => "gzcat", # for uncompressing gz => "gzcat", bz => "bzcat", bz2 => "bzcat", bzip => "bzcat", bzip2 => "bzcat", lzma => "lzcat", ); if ($path =~ m{ \. ( [^.\s] +) \z }x) { if (my $prog = $Compress{$1}) { # HIP HIP HURRAY! for magic open!!! # HIP HIP HURRAY! for magic open!!! # HIP HIP HURRAY! for magic open!!! return "$prog $path |"; } } return $path; } END { close(STDIN) || die "couldn't close stdin: $!"; close(STDOUT) || die "couldn't close stdout: $!"; } UNITCHECK { $SIG{ PIPE } = sub { exit }; $SIG{__WARN__} = sub { confess "trapped uncaught warning" unless $^S; }; }
It dumped the following info:
Well, after much further research, I discover that particular character, isutf8 "\x99" does not map to Unicode at ./word_lets.cgi line 1
the tm in Latin-1, or ™ (™) in decimal, using UTF-8.
Now, I'd just stop there, and send Perl, Find, Grep, Cat, or Awk on a seek-and-replace
mission. Then be done with it. But I'm sure this (that) isn't the last of them.
It all wouldn't be such a big deal, except I have over one hundred files to deal with.
Surely I'm not the only one that's had to overcome something like this. I did spend
quite some time trying to find a solution reading all the perldoc's. While there was much to
learned regarding :unicode && :utf-8 | :utf8, last time I tried to slurp a file in, and modify it
within Perl using unicode | or utf8, I ended up with ms-dos/windows line endings (CR/LF), and I'm
on a BSD-UNIX machine. :(
Any, and all help/pointers greatly appreciated.
Thank you for all your consideration.
--chris
#!/usr/bin/perl -Tw use perl::always; my $perl_version = "5.12.4"; print $perl_version;
|
---|
Replies are listed 'Best First'. | |
---|---|
Re: Can Perl convert ISO-? | WIN-? | MAC-? to UTF-8?
by choroba (Cardinal) on May 23, 2013 at 06:03 UTC | |
by taint (Chaplain) on May 23, 2013 at 06:21 UTC | |
Re: Can Perl convert ISO-? | WIN-? | MAC-? to UTF-8?
by Anonymous Monk on May 23, 2013 at 06:40 UTC | |
by taint (Chaplain) on May 23, 2013 at 14:30 UTC | |
Re: Can Perl convert ISO-? | WIN-? | MAC-? to UTF-8?
by Khen1950fx (Canon) on May 23, 2013 at 06:04 UTC | |
by taint (Chaplain) on May 23, 2013 at 06:26 UTC | |
| |
by Anonymous Monk on May 23, 2013 at 06:30 UTC |
Back to
Seekers of Perl Wisdom