http://www.perlmonks.org?node_id=1226629

makafre has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,

My goal is to have any character which is not within the GSM 7-bit default alphabet, to be removed from a given string. Basically it is a mixture of regular/accented/greek letters and some ponctutation. The complete character set is the first table at this URL: https://en.wikipedia.org/wiki/GSM_03.38 :

The list is:

@£$¥èéùìòÇØøÅå&#916;_&#934;&#915;&#923;&#937;&#928;&#936;&#931;&#920;&#926;ÆæßÄÖÑܧ¿äöñüàÉ !"#¤%&'()*+,-./:;<=>?¡A-Za-z0-9

Of course this is not working:

 $str =~ s/[^\@£\$¥èéùìòÇØøÅå&#916;_&#934;&#915;&#923;&#937;&#928;&#936;&#931;&#920;&#926;ÆæßÉ !"#¤%\&\'\(\)\*\+\,\-\.\/0-9:;<=>\?¡A-ZÄÖÑܧ¿a-zäöñüà]//g;

I searched for hours on how to do this and I am seeking your knowledge on which direction to take to accomplish this. (it seems that the site replaced some of the above characters with HTML code, sorry for this)

Thank you

Replies are listed 'Best First'.
Re: How to replace all non GSM 7-bit default characters?
by Perlbotics (Archbishop) on Dec 02, 2018 at 18:02 UTC

    What's your input character set? Latin-X? UTF-Y?
    The demo section of GSM::Nbit provides a good hint:

    use Encode qw/encode decode/; ... # We need to encode it first - for details see: # http://www.dreamfabric.com/sms/default_alphabet.html my $txt0338 = encode("gsm0338", $txt); # <--- look! ...
    I havn't checked it, but Encode seems to have solved your problem already or can do most of the heavy lifting for you? See also: Encode::GSM0338
    The link to dreamfabric seems no longer availabe, but the Wayback Machine still remembers.

      Works perfectly ! Thank you !
Re: How to replace all non GSM 7-bit default characters?
by kcott (Archbishop) on Dec 04, 2018 at 07:57 UTC

    G'day makafre,

    Welcome to the Monastery.

    I see ++Perlbotics has provided a solution which you say "Works perfectly".

    It's difficult to see exactly what you did with the s///, due to all the entity references muddying the waters (see below). You could have done this with y///cd (which is typically faster than s///). Using one of my standard aliases:

    $ alias perlu alias perlu='perl -Mstrict -Mwarnings -Mautodie=:all -Mutf8 -C -E'

    Here's a short example to show the technique:

    $ perlu 'my $x = "X☺¥☺æZ"; say $x; $x =~ y/X¥æZ//cd; say $x'
    X☺¥☺æZ
    X¥æZ
    

    See perlrun for any command switches that are unfamiliar to you.

    "it seems that the site replaced some of the above characters with HTML code, sorry for this"

    In general, you should put your code, data, and output within <code>...</code> tags: see "Writeup Formatting Tips" for more about that. However, with non-ASCII characters, it's best to use <pre>...</pre> tags, as I have in the example above. I also use <tt>...</tt> tags for displaying such characters in-line (e.g. in a sentence).

    — Ken