Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Re: Match full utf-8 characters

by afoken (Chancellor)
on Apr 29, 2019 at 12:51 UTC ( [id://1233111]=note: print w/replies, xml ) Need Help??


in reply to Match full utf-8 characters

echo -n 'a…b' | perl -pe 's@(.).(.)@$1$2@';

echo may generate a byte stream representing three Unicode characters, but Perl reads it as byte stream, not as Unicode characters. So you are cutting out a byte, not a character, and get back garbage. Also, perl writes out bytes, not Unicode characters.

Tell perl to treat STDIN and STDOUT as Unicode character streams and everything works as expected:

>echo -n 'a…b' | perl -pe 's@(.).(.)@$1$2@'
a▒▒b
>echo -n 'a…b' | perl -CIO -pe 's@(.).(.)@$1$2@'
ab
>perl -v

This is perl 5, version 22, subversion 2 (v5.22.2) built for x86_64-linux-thread-multi

Copyright 1987-2015, Larry Wall

Perl may be copied only under the terms of either the Artistic License or the
GNU General Public License, which may be found in the Perl 5 source kit.

Complete documentation for Perl, including FAQ lists, should be found on
this system using "man perl" or "perldoc perl".  If you have access to the
Internet, point your browser at http://www.perl.org/, the Perl Home Page.

>

See also -C in perlrun, and the thread any use of 'use locale'?, especially the subthread Re^3: any use of 'use locale'? (source encoding).

Alexander

--
Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)

Replies are listed 'Best First'.
Re^2: Match full utf-8 characters
by Allasso (Monk) on Apr 29, 2019 at 13:24 UTC
    Thanks for shedding light on this. I'm actually planning to use this in a script, so I see now I need to have:
    binmode(STDIN, ':utf8'); binmode(STDOUT, ':utf8');
    at the top of my script.
      ...Although, I find this works when I collect user input with the construct:
      my $user_input = <STDIN>;
      but does not work when I collect it with:
      my $stdin = new IO::Handle; $stdin->fdopen( fileno( STDIN ), "r" ) || die "Cannot open STDIN"; while ( my $char = $stdin->getc() ) { }
      and I need to use hippo's suggestion below. In each iteration, $char is a byte. Is there a way to coerce $char to be utf8?

        That's a very convoluted way of achieving something that's already done for you:

        $ perl -E'say STDIN->can("getc")' $ perl -MIO::Handle -E'say STDIN->can("getc")' CODE(0x55dddd02b150)
        I.e. if IO::Handle is loaded, STDIN is already an object resembling IO::Handle and you can call getc method on it.

        But for the sake of the exercise, you should be able to pass "<:utf8" instead of "r" to fdopen and have getc return Unicode characterscode points again. (untested)

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1233111]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having a coffee break in the Monastery: (4)
As of 2024-04-23 22:55 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found