Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things

ASCII, Unicode, use utf8: My Story of Discovery

by princepawn (Parson)
on Nov 01, 2002 at 16:54 UTC ( #209781=perlmeditation: print w/replies, xml ) Need Help??

1 day ago my boss said that some text files could not be integrated with a search engine because the characters with a cedille on them were causing it to barf.

When I looked at this file in Emacs, I saw this \347 I did a bit of reading on how emacs sees a buffer as a sequence of 8-bit bytes unless you turn on various interpretation modes, and then went on about my business.

Next I went to and saw the offending character. Somehow the C with a cedille was being displayed by the Windows command window as a greek "tau". Which has ascii value 231. And sure enough the when I ran my file thru this program

use strict; my $file = shift or die 'must supply filename'; open my $fh, $file or die "couldnt open $file: $!"; $\="\n"; while (<$fh>) { # print $.,$/; my $char; while (/\G(.)/g) { ++$char; my $c=$1; if ($c =~ /[[:^print:]]/) { print "plain_text test failed on row $. with char # $char: <$c +>\n" . "Unicode Value: " . unpack('C', $c); print "context: " . substr($_, 0, $char+5); } } }
It said that the bad character was 231. But it also flagged a lot of other things in the file... possibly because one "wide bit" character was being interpreted as several 8-bit chars. And then I read on the utf8 pragma in the Perl standard docs. And I put use utf8 at the top of the program. And the presto-chango, only the cedille was detected.

What happened was that Perl saw my file as a sequence of Unicode characters instead of as a sequence of 8-bit bytes. So how does one decide Unicode characters are appropriate for their application? One takes a look at the Unicode Code Charts. So, after looking at these, I was certain of which Unicode values I wanted to accept but I was only fairly sure that the Perl POSIX :print: character class equivalent. So, I took the low road (or is that the high road?) and wrote this to determine whether to accept a string of text:

use utf8; my $U; while ($column =~ /\G(.)/g) { # WRONG! Thanks John M. Dlugosz $U = unpack('C', $1); $U = unpack('U', $1); # Now that's the ticket $U < 127 and $U > 31 or return; } return 1;

Replies are listed 'Best First'.
Re: ASCII, Unicode, use utf8: My Story of Discovery
by John M. Dlugosz (Monsignor) on Nov 01, 2002 at 19:45 UTC
    In your first listing, the regex engine will match on every byte, so if you feed it a UTF-8-encoded file it will report multiple-byte sequences as its component values. Meanwhile, you are unpacking 'C', which also "only does bytes". So the program is consistant, but the output labeling is wrong: it's not "Unicode Value:", it's "byte value:".

    On your second listing, the regex is in the scope of use utf8, so the dot will match a multi-byte character as one character. But then you use unpack 'C' again which ignores the fact that $1 might have a multi-byte character in it, and just returns the value of the first byte.

    Now UTF8-encoding is designed to overcome the headaches of past variable-length encoding systems. It's very easy because a character that's represented by a single byte always has the high bit cleared, and all bytes that are part of a multi-byte sequence all have their high bits set.

    So, when you test for your range of 32..126 inclusive, you are indeed going to test for ASCII graphics characters because (by design) UTF8 is a proper superset of ASCII. You are picking out those bytes that are single-byte characters that also are not control codes (<32) or the DEL character (127).

    The unpack is a roundabout way of doing that. If you just used ord(), you would respect the multibyte nature of what's in $1, and get numbers >256 as applicable. This would work about the same but would not rely on this artifact of UTF8 encoding.

    But you can skip the while loop completely!

    use utf8; return ! ($column =~ /[^\x21-\x7f]/);
    will also return false if the $column contains any character outside of that range, true if it contains only characters in that range.

    Meanwhile, ASCII doesn't have a character 231, since it only goes up to 127. Windows displays an ANSI character in your current code page, which varies based on what country you are in. The command window is using the OEM character set to be compatible with old DOS text programs, which is why it interprets 231 as a different character!


Re: ASCII, Unicode, use utf8: My Story of Discovery
by Rich36 (Chaplain) on Nov 01, 2002 at 18:35 UTC

    Oddly enough, I had just started to look for code that would allow me to detect special (non-alphanumeric) characters in ASCII text. The code

    $U = unpack('C', $1); $U < 127 and $U > 31 or return;

    was exactly what I needed. Thanks very much for sharing that.

Re: ASCII, Unicode, use utf8: My Story of Discovery
by richardX (Pilgrim) on Nov 04, 2002 at 04:13 UTC
    I have had my log parsing routines crash because of invalid ASCII characters, so I run this code against it, which cleans up the bad boys found.
    # loop through the file zapping the bad characters found while(<FILE>) { $lineBuff = $_; # remove upper ascii $lineBuff =~ s/([\x7F-\xFF]+)/$delimiter/gm; # remove lower ascii $lineBuff =~ s/([\x00-\x1F]+)/$delimiter/gm; $lineBuff =~ s/\%//gm; # send the clean data to the output file print OUT "$lineBuff\n"; }


    There are three types of people in this world, those that can count and those that cannot. Anon

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlmeditation [id://209781]
Approved by scain
Front-paged by jarich
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others taking refuge in the Monastery: (7)
As of 2017-05-27 11:05 GMT
Find Nodes?
    Voting Booth?