Beefy Boxes and Bandwidth Generously Provided by pair Networks Bob
laziness, impatience, and hubris
 
PerlMonks  

Search & replace of UTF-8 characters ?

by levien (Initiate)
on Feb 25, 2010 at 16:25 UTC ( #825333=perlquestion: print w/ replies, xml ) Need Help??
levien has asked for the wisdom of the Perl Monks concerning the following question:

I would like to request the wisdom of the Monks on the hairy subject of UTF-8...

To fix some problems with BibTeX and UTF-8 characters, I've made a little perl-script that (among other things) should replace some UTF-8 characters by TeX-macros. The problems is that I cannot seem to get the regular expressions to recognise UTF-8 characters.

I use something like:

while (my $line = <>) { $line =~ s/\x{00a9}/\\textcopyright/g; # "" $line =~ s/\x{2010}/\-/g; # "&#8208;" $line =~ s/\x{fffd}/\\,/g; # "&#65533;" $line =~ s/\x{03b4}/\$\\delta\$/g; # "&#948;" $line =~ s/\x{00c5}/\\AA\{\}/g; # "" print $line; }

It run this in Perl 5.10.0 on a 64-bit Ubuntu Jaunty system. The locale is set to en_US.UTF-8, and I checked that the input-file is really UTF-8.

When I run the script however, it seems to replace only the 8-bit ASCII characters 0xa5 and 0xc5 (resulting in invalid UTF-8 output), instead of replacing the UTF-8 ones as I intended. I tried adding "use utf8", "-CIO" and/or setting STDIN and STDOUT to ":utf8" using binmode, but it doesn't seem to make a difference.

I'm a bit stuck now, does anyone know what I'm doing wrong?

Best regards, Levien

Comment on Search & replace of UTF-8 characters ?
Download Code
Re: Search & replace of UTF-8 characters ?
by ikegami (Pope) on Feb 25, 2010 at 16:42 UTC

    $line is still encoded. A character won't match the UTF-8 encoding of that character unless it's an ASCII character.

    You had the right idea with -C / use open / binmode. The catch is that you're not reading from STDIN, you're reading from ARGV, and those don't work well if at all with ARGV.

    The solution: Don't use ARGV.

    my %fixes = ( "\x{00a9}" => '\\textcopyright', "\x{2010}" => '-', "\x{fffd}" => '\\,', "\x{03b4}" => '$\\delta$', "\x{00c5}" => '\\AA{}', ); my ($re) = map qr/$_/, join '|', map quotemeta, keys(%fixes); @ARGV = '-' if !@ARGV; for my $ARGV (@ARGV} { my $fh; if ($ARGV eq '-') { open($fh, '<&:encoding(UTF-8), *STDIN) or die "Can't dup STDIN: $!\n"); } else { open($fh, '<:encoding(UTF-8), $ARGV) or die "Can't open \"$ARGV\": $!\n"); } for (;;) { last if eof($fh); defined( my $line = <$fh> ) or die("Can't read from \"$ARGV\": $!\n"); $line =~ s/($re)/$fixes{$1}/g; print $line; } }

    Yeah, it sucks. Especially since ARGV normally does that error checking for you.

      I'm asking, not arguing...

      Wouldn't it have worked if the script was called via a command-line pipe? So if it was called as

      ./levians_program.pl < source.bib > source_corrected.bib
      that ought to work, right?

        If you added some means of decoding to STDIN and encoding STDOUT, yes.

        >perl -CSD -we"print chr 0x2660" | perl -CSD -we"printf qq{%X\n}, ord +<STDIN>" 2660

        -C even works if you read STDIN through ARGV:

        >perl -CSD -we"print chr 0x2660" | perl -CSD -we"printf qq{%X\n}, ord +<>" 2660

        I don't have time to check the other tools right now.

        Update: Hey! -C DOES work with ARGV. I knew binmode and use open had problems with ARGV, so I took the OP's word for it when he said -C didn't work with it either.

        >perl -CSD -we"print chr 0x2660" > foo >perl -CSD -we"printf qq{%X\n}, ord <>" foo 2660
Re: Search & replace of UTF-8 characters ?
by 7stud (Deacon) on Feb 25, 2010 at 17:19 UTC
    $line is still encoded. A character won't match the UTF-8 encoding of that character unless it's an ASCII character.

    While that may be an accurate statement, trying to decipher what it means is not easy.

    Here is how I would put it: a unicode character is not the same as a unicode character encoded in UTF-8. There are many encodings, and UTF-8 is only one of them. However, there is only one unicode character for the copyright symbol. Simply put, if you want to match UTF-8 characters in a string, then you need to use UTF-8 characters in your substitution--not unicode characters.

    Here is a code example:

    use strict; use warnings; use 5.010; use Encode; my $unicode_str = "\x{00a9}"; my $utf8_str = encode('utf-8', $unicode_str); say $utf8_str; #copyright symbol my $line = "$utf8_str hello world"; $line =~ s/$utf8_str/\\textcopyright/; say $line; #\textcopyright hello world #Or you can just start with the UTF-8 character #for the copyright symbol: $line = "\xC2\xA9 hello world"; say $line; #copyright symbol followed by 'hello world' $line =~ s/\xC2\xA9/\\textcopyright/; say $line; #\textcopyright hello world

    In my opinion, the easiest way to understand the whole unicode thing is this: a unicode escape sequence is an integer. An 'encoding' converts a unicode integer into a character. An encoding is just a list that looks like this:

    1 => chinese character for the new year 2 => japanese character for fish 3 => happy face ... ... 60,000 => mongolian character for beef ...

    So an encoding takes unicode integers and translates them into characters. Different encodings translate the unicode integers into different characters. UTF-8 is just one encoding, which is very popular.

      While that may be an accurate statement, trying to decipher what it means is not easy

      I didn't want to spend much time confirming something the OP appeared to already know, but thanks for elaborating.

      Update: Although I think your elaboration is flawed.

      a unicode escape sequence is an integer. An 'encoding' converts a unicode integer into a character. An encoding is just a list that looks like this:

      Determine the character a value represents is unrelated to encoding/decoding.

      Decoding from UTF-8:

      ... 01 => 01 START OF HEADING ... 30 => 30 DIGIT ZERO ... E2 99 A0 => 2660 BLACK SPADE SUIT ...

      Encoding is the reverse operation.

      There is no difference between 2660 and black spade suit. Black spade suit is just a meaning assumed by 2660. Decoding is definitely not the process of going from 2660 to black spade suit as you claim.

        double post somehow
        Although I think your elaboration is flawed.

        It's definitely not accurate. At the same time, anyone can understand my model, and they should be able to use it to successfully distinguish between unicodes and encodings like utf-8--and convert between them. Or they can read a tutorial an unicode and be completely confused, and not be able to write any code at all.

        Decoding is definitely not the process of going from 2660 to black spades suit as you claim.

        Encoding = convert unicode integer to utf-8 character for output

        Decoding = convert utf-8 character to unicode integer for input

        That simple model will allow any unicode beginner to write a lot of code before having to adjust their mental model. For what it's worth, I've never read a single unicode tutorial that will actually allow you to write code.

Re: Search & replace of UTF-8 characters ?
by levien (Initiate) on Feb 26, 2010 at 00:16 UTC

    Thank you for the answer and the explanations! It works as it should now, and I learned a thing or two about perl and UTF-8. :-)

    Indeed I hadn't realised that the "easy" way of doing a search & replace does not use STDIN but ARGV, and also that it does not consider the input files as being UTF-8 encoded by default...

      and also that it does not consider the input files as being UTF-8 encoded by default...

      Perl has no idea what's in the file. It cannot assume the file's content is text encoded with UTF-8. In fact, it cannot assume the file's content is text at all. Unless you tell Perl otherwise, it gives you the file's contents: bytes.

        perl does assume it is text, that is why you have to binmode

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://825333]
Approved by ww
Front-paged by ww
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others browsing the Monastery: (9)
As of 2014-04-17 00:02 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    April first is:







    Results (436 votes), past polls