Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical

Match full utf-8 characters

by Allasso (Monk)
on Apr 29, 2019 at 12:16 UTC ( #1233108=perlquestion: print w/replies, xml ) Need Help??

Allasso has asked for the wisdom of the Perl Monks concerning the following question:

Trying to match full utf-8 characters, eg, get the following to output ab with ellipsis (e280a6) removed:
echo -n 'ab' | perl -pe 's@(.).(.)@$1$2@';
Instead of only removing the first unicode point. I have been reading and, but can't seem to find a combination that works. I've tried:
echo -n 'ab' | perl -pe 'use utf8; s@(.).(.)@$1$2@'; echo -n 'ab' | perl -pe 'utf8::encode($_); s@(.).(.)@$1$2@'; echo -n 'ab' | perl -pe 'utf8::upgrade($_); s@(.).(.)@$1$2@'; echo -n 'ab' | perl -pe 'use Encode qw(decode encode); $_ = encode("u +tf-8", $_); s@(.).(.)@$1$2@'; echo -n 'ab' | perl -pe 'use Encode qw(decode encode); $_ = encode("u +tf8", $_); s@(.).(.)@$1$2@';
The only thing I've found so far that works is this, but is deprecated :-/ (and I believe applies to the whole script):
echo -n 'ab' | perl -pe 'use encoding 'utf8', Filter => 1; s@(.).(.)@ +$1$2@';

Replies are listed 'Best First'.
Re: Match full utf-8 characters
by afoken (Canon) on Apr 29, 2019 at 12:51 UTC
    echo -n 'ab' | perl -pe 's@(.).(.)@$1$2@';

    echo may generate a byte stream representing three Unicode characters, but Perl reads it as byte stream, not as Unicode characters. So you are cutting out a byte, not a character, and get back garbage. Also, perl writes out bytes, not Unicode characters.

    Tell perl to treat STDIN and STDOUT as Unicode character streams and everything works as expected:

    >echo -n 'ab' | perl -pe 's@(.).(.)@$1$2@'
    >echo -n 'ab' | perl -CIO -pe 's@(.).(.)@$1$2@'
    >perl -v
    This is perl 5, version 22, subversion 2 (v5.22.2) built for x86_64-linux-thread-multi
    Copyright 1987-2015, Larry Wall
    Perl may be copied only under the terms of either the Artistic License or the
    GNU General Public License, which may be found in the Perl 5 source kit.
    Complete documentation for Perl, including FAQ lists, should be found on
    this system using "man perl" or "perldoc perl".  If you have access to the
    Internet, point your browser at, the Perl Home Page.

    See also -C in perlrun, and the thread any use of 'use locale'?, especially the subthread Re^3: any use of 'use locale'? (source encoding).


    Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)
      Thanks for shedding light on this. I'm actually planning to use this in a script, so I see now I need to have:
      binmode(STDIN, ':utf8'); binmode(STDOUT, ':utf8');
      at the top of my script.
        ...Although, I find this works when I collect user input with the construct:
        my $user_input = <STDIN>;
        but does not work when I collect it with:
        my $stdin = new IO::Handle; $stdin->fdopen( fileno( STDIN ), "r" ) || die "Cannot open STDIN"; while ( my $char = $stdin->getc() ) { }
        and I need to use hippo's suggestion below. In each iteration, $char is a byte. Is there a way to coerce $char to be utf8?
Re: Match full utf-8 characters
by hippo (Chancellor) on Apr 29, 2019 at 12:51 UTC
    echo -n 'ab' | perl -pe 'utf8::encode($_); s@(.).(.)@$1$2@';

    Ah, you were so close.

    echo -n 'ab' | perl -pe 'utf8::decode ($_); s@(.).(.)@$1$2@';
      Thanks! Counter-intuitive to me, but probably because of my inexperience.

        I thought that too when I first came across it. Here's the way I remember it now: Perl has its own internal way of storing the data - don't worry about what it is, just know that it's some special thing. When your code acquires data (from STDIN or another filehandle or a database or a web request) the data has some encoding. To process it you need to decode it from that encoding into the special internal format, so you use decode. Conversely when sending data out of perl (to STDOUT or a database or a web response or ...) you need to encode it back into what the receiving system expects so you use encode. This is why it helps to think of utf-8, utf-16, Latin-1, etc. as encodings (and this is indeed what they are).

        All the magic shortcuts you might come across such as encoding layers are performing these operations automatically for you but really it's just decode and encode under the bonnet.

Re: Match full utf-8 characters
by hdb (Monsignor) on Apr 29, 2019 at 12:38 UTC

    This is a workaround but should remove all non 255-ASCII characters:

    use strict; use warnings; use utf8; my $str = 'ab'; $str =~ s/(.)/ord($1)<256?$1:''/ge; print "$str\n";
      This works if $str is implicitly provided in the script, but not when it is read from stdin.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1233108]
Front-paged by Corion
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chanting in the Monastery: (5)
As of 2020-10-28 09:50 GMT
Find Nodes?
    Voting Booth?
    My favourite web site is:

    Results (260 votes). Check out past polls.