Beefy Boxes and Bandwidth Generously Provided by pair Networks BBQ
Syntactic Confectionery Delight
 
PerlMonks  

How to remove other language character from a string

by Anonymous Monk
on Nov 26, 2012 at 05:11 UTC ( #1005553=perlquestion: print w/ replies, xml ) Need Help??
Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

I want to remove all the other languages from my sentence and grep only English alphabet.

e.g : ครัวซองเเซนด์วิชไข่ดาว Croissant Egg Sandwich ครัวซองเเซนด์วิชไข่ดาว

Assuming this as an example,my code is like this :

$image ='ครัวซองเ&#364 +8;ซนด์วิชไข&#36 +56;ดาว Croissant Egg Sandwich ครั +วซองเเซนด&#3660 +;วิชไข่ดาว'; $image =~s/\p{Thai}//; print $image;
But the output is the same string, I want only 'Croissant Egg Sandwich' as output. Please help me out in this.

Comment on How to remove other language character from a string
Download Code
Re: How to remove other language character from a string
by moritz (Cardinal) on Nov 26, 2012 at 05:27 UTC

    You need to use utf8; to tell Perl that your source file is in UTF-8. That way non-ASCII literal strings work the way you want them to.

    use strict;
    use warnings;
    use 5.010;
    use utf8;
    binmode STDOUT, ':encoding(UTF-8)';
    
    my $str = "ครัวซองเเซนด์วิชไข่ดาว Croissant Egg Sandwich ครัวซองเเซนด์วิชไข่ดาว";
    $str =~ s/[^\p{Latin}\p{Common}]//g;
    $str =~ s/^\s+|\s+$//g;
    say $str;
    __END__
    Croissant Egg Sandwich
    

    See also: Character Encodings in Perl.

    Updated to unlinkify the brackets, and to exclude \p{Common} instead of \s from removal.

      Thanks moritz, but when I tried this I got the output like this:
      α╕α╕α╕▒α╕α&# +9557;α╕α╕α╣α╣α& +#9557;α╕α╕α╣α╕α +╕┤α╕α╣α╕α╣ +α╕α╕▓α╕ Croissant Egg Sandwich α╕α╕α╕▒α&#9557 +;α╕α╕α╕α╣α&#957 +1;α╕α╕α╕α╣α&#95 +57;α╕┤α╕α╣α╕&#9 +45;╣α╕α╕▓α╕

        That's because it wasn't formatted correctly due to missing code tags (which were presumably left out so that the input text would be shown properly). When I first ran moritz's code, I just got the original string, but when I substituted:

        $str =~ s/[^\p{Latin}\s]//g;

        for this:

        $str =~ s/^\p{Latin}\s//g;

        it worked.

        EDIT: If you have lots of extra spaces in your output, you could run it through $str =~ s/ {2,}/ /g;, too. Something to keep in mind is that moritz's approach (as is) will remove punctuation.

Re: How to remove other language character from a string
by grondilu (Pilgrim) on Nov 26, 2012 at 05:50 UTC

    Why not simply use s/[^a-zA-Z ]//gr ?

Re: How to remove other language character from a string
by Tux (Monsignor) on Nov 26, 2012 at 07:00 UTC

    If what I read is actually your code, you'll need HTML::Entities and a /g modifier to s///:

    $ cat test.pl use 5.14.1; use warnings; use HTML::Entities; my $image; $image ='ครัวซองเ&#364 +8;ซนด์วิชไข&#36 +56;ดาว Croissant Egg Sandwich ครั +วซองเเซนด&#3660 +;วิชไข่ดาว'; $image = decode_entities ($image); $image =~ s/\p{Thai}//g; print $image; $ perl test.pl Croissant Egg Sandwich $

    Enjoy, Have FUN! H.Merijn
Re: How to remove other language character from a string
by kcott (Abbot) on Nov 26, 2012 at 08:29 UTC

    You could use the transliteration operator (y/// or tr///) - see Quote-Like Operators.

    $ perl -Mstrict -Mwarnings -E '
    my $x = q{ครัวซองเเซนด์วิชไข่ดาว Croissant Egg Sandwich ครัวซองเเซนด์วิชไข่ดาว};
    $x =~ y/ -~//cd;
    say $x;
    '
     Croissant Egg Sandwich 
    

    -- Ken

Re: How to remove other language character from a string
by Anonymous Monk on Nov 26, 2012 at 09:54 UTC
    Thanks everyone, All of your support was very much helpful and all of yours method worked but I selected moritz and frzeon's suggestion as I found it more genuine and easy to understand too.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1005553]
Approved by moritz
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (14)
As of 2014-04-18 16:47 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    April first is:







    Results (470 votes), past polls