Think about Loose Coupling

Re: How to remove other language character from a string

by moritz (Cardinal)
on Nov 26, 2012 at 05:27 UTC

in reply to How to remove other language character from a string

You need to use utf8; to tell Perl that your source file is in UTF-8. That way non-ASCII literal strings work the way you want them to.

use strict;
use warnings;
use 5.010;
use utf8;
binmode STDOUT, ':encoding(UTF-8)';

my $str = "ครัวซองเเซนด์วิชไข่ดาว Croissant Egg Sandwich ครัวซองเเซนด์วิชไข่ดาว";
$str =~ s/[^\p{Latin}\p{Common}]//g;
$str =~ s/^\s+|\s+$//g;
say $str;
Croissant Egg Sandwich

See also: Character Encodings in Perl.

Updated to unlinkify the brackets, and to exclude \p{Common} instead of \s from removal.

Re^2: How to remove other language character from a string
by Anonymous Monk on Nov 26, 2012 at 05:36 UTC
    Thanks moritz, but when I tried this I got the output like this:
    α╕α╕α╕▒α╕α&# +9557;α╕α╕α╣α╣α& +#9557;α╕α╕α╣α╕α +╕┤α╕α╣α╕α╣ +α╕α╕▓α╕ Croissant Egg Sandwich α╕α╕α╕▒α&#9557 +;α╕α╕α╕α╣α&#957 +1;α╕α╕α╕α╣α&#95 +57;α╕┤α╕α╣α╕&#9 +45;╣α╕α╕▓α╕

      That's because it wasn't formatted correctly due to missing code tags (which were presumably left out so that the input text would be shown properly). When I first ran moritz's code, I just got the original string, but when I substituted:

      $str =~ s/[^\p{Latin}\s]//g;

      for this:

      $str =~ s/^\p{Latin}\s//g;

      it worked.

      EDIT: If you have lots of extra spaces in your output, you could run it through $str =~ s/ {2,}/ /g;, too. Something to keep in mind is that moritz's approach (as is) will remove punctuation.

        It worked smoothly. Thanks Frozenwithjoy and moritz.

Node Type: note
and all is quiet...

