Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

How to remove other language character from a string

by Anonymous Monk
on Nov 26, 2012 at 05:11 UTC ( #1005553=perlquestion: print w/ replies, xml ) Need Help??
Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

I want to remove all the other languages from my sentence and grep only English alphabet.

e.g : ครัวซองเเซนด์วิชไข่ดาว Croissant Egg Sandwich ครัวซองเเซนด์วิชไข่ดาว

Assuming this as an example,my code is like this :

$image ='ครัวซองเ&#364 +8;ซนด์วิชไข&#36 +56;ดาว Croissant Egg Sandwich ครั +วซองเเซนด&#3660 +;วิชไข่ดาว'; $image =~s/\p{Thai}//; print $image;
But the output is the same string, I want only 'Croissant Egg Sandwich' as output. Please help me out in this.

Comment on How to remove other language character from a string
Download Code
Re: How to remove other language character from a string
by moritz (Cardinal) on Nov 26, 2012 at 05:27 UTC

    You need to use utf8; to tell Perl that your source file is in UTF-8. That way non-ASCII literal strings work the way you want them to.

    use strict;
    use warnings;
    use 5.010;
    use utf8;
    binmode STDOUT, ':encoding(UTF-8)';
    
    my $str = "ครัวซองเเซนด์วิชไข่ดาว Croissant Egg Sandwich ครัวซองเเซนด์วิชไข่ดาว";
    $str =~ s/[^\p{Latin}\p{Common}]//g;
    $str =~ s/^\s+|\s+$//g;
    say $str;
    __END__
    Croissant Egg Sandwich
    

    See also: Character Encodings in Perl.

    Updated to unlinkify the brackets, and to exclude \p{Common} instead of \s from removal.

      Thanks moritz, but when I tried this I got the output like this:
      α╕α╕α╕▒α╕α&# +9557;α╕α╕α╣α╣α& +#9557;α╕α╕α╣α╕α +╕┤α╕α╣α╕α╣ +α╕α╕▓α╕ Croissant Egg Sandwich α╕α╕α╕▒α&#9557 +;α╕α╕α╕α╣α&#957 +1;α╕α╕α╕α╣α&#95 +57;α╕┤α╕α╣α╕&#9 +45;╣α╕α╕▓α╕

        That's because it wasn't formatted correctly due to missing code tags (which were presumably left out so that the input text would be shown properly). When I first ran moritz's code, I just got the original string, but when I substituted:

        $str =~ s/[^\p{Latin}\s]//g;

        for this:

        $str =~ s/^\p{Latin}\s//g;

        it worked.

        EDIT: If you have lots of extra spaces in your output, you could run it through $str =~ s/ {2,}/ /g;, too. Something to keep in mind is that moritz's approach (as is) will remove punctuation.

Re: How to remove other language character from a string
by grondilu (Pilgrim) on Nov 26, 2012 at 05:50 UTC

    Why not simply use s/[^a-zA-Z ]//gr ?

Re: How to remove other language character from a string
by Tux (Monsignor) on Nov 26, 2012 at 07:00 UTC

    If what I read is actually your code, you'll need HTML::Entities and a /g modifier to s///:

    $ cat test.pl use 5.14.1; use warnings; use HTML::Entities; my $image; $image ='ครัวซองเ&#364 +8;ซนด์วิชไข&#36 +56;ดาว Croissant Egg Sandwich ครั +วซองเเซนด&#3660 +;วิชไข่ดาว'; $image = decode_entities ($image); $image =~ s/\p{Thai}//g; print $image; $ perl test.pl Croissant Egg Sandwich $

    Enjoy, Have FUN! H.Merijn
Re: How to remove other language character from a string
by kcott (Abbot) on Nov 26, 2012 at 08:29 UTC

    You could use the transliteration operator (y/// or tr///) - see Quote-Like Operators.

    $ perl -Mstrict -Mwarnings -E '
    my $x = q{ครัวซองเเซนด์วิชไข่ดาว Croissant Egg Sandwich ครัวซองเเซนด์วิชไข่ดาว};
    $x =~ y/ -~//cd;
    say $x;
    '
     Croissant Egg Sandwich 
    

    -- Ken

Re: How to remove other language character from a string
by Anonymous Monk on Nov 26, 2012 at 09:54 UTC
    Thanks everyone, All of your support was very much helpful and all of yours method worked but I selected moritz and frzeon's suggestion as I found it more genuine and easy to understand too.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1005553]
Approved by moritz
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others avoiding work at the Monastery: (4)
As of 2015-07-04 02:59 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (57 votes), past polls