Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

How to replace all non GSM 7-bit default characters?

by makafre (Initiate)
on Dec 02, 2018 at 16:40 UTC ( #1226629=perlquestion: print w/replies, xml ) Need Help??
makafre has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,

My goal is to have any character which is not within the GSM 7-bit default alphabet, to be removed from a given string. Basically it is a mixture of regular/accented/greek letters and some ponctutation. The complete character set is the first table at this URL: https://en.wikipedia.org/wiki/GSM_03.38 :

The list is:

@$&#916;_&#934;&#915;&#923;&#937;&#928;&#936;&#931;&#920;&#926;ܧ !"#%&'()*+,-./:;<=>?A-Za-z0-9

Of course this is not working:

 $str =~ s/[^\@\$&#916;_&#934;&#915;&#923;&#937;&#928;&#936;&#931;&#920;&#926; !"#%\&\'\(\)\*\+\,\-\.\/0-9:;<=>\?A-Zܧa-z]//g;

I searched for hours on how to do this and I am seeking your knowledge on which direction to take to accomplish this. (it seems that the site replaced some of the above characters with HTML code, sorry for this)

Thank you

Replies are listed 'Best First'.
Re: How to replace all non GSM 7-bit default characters?
by Perlbotics (Chancellor) on Dec 02, 2018 at 18:02 UTC

    What's your input character set? Latin-X? UTF-Y?
    The demo section of GSM::Nbit provides a good hint:

    use Encode qw/encode decode/; ... # We need to encode it first - for details see: # http://www.dreamfabric.com/sms/default_alphabet.html my $txt0338 = encode("gsm0338", $txt); # <--- look! ...
    I havn't checked it, but Encode seems to have solved your problem already or can do most of the heavy lifting for you? See also: Encode::GSM0338
    The link to dreamfabric seems no longer availabe, but the Wayback Machine still remembers.

      Works perfectly ! Thank you !
Re: How to replace all non GSM 7-bit default characters?
by kcott (Chancellor) on Dec 04, 2018 at 07:57 UTC

    G'day makafre,

    Welcome to the Monastery.

    I see ++Perlbotics has provided a solution which you say "Works perfectly".

    It's difficult to see exactly what you did with the s///, due to all the entity references muddying the waters (see below). You could have done this with y///cd (which is typically faster than s///). Using one of my standard aliases:

    $ alias perlu alias perlu='perl -Mstrict -Mwarnings -Mautodie=:all -Mutf8 -C -E'

    Here's a short example to show the technique:

    $ perlu 'my $x = "X☺☺Z"; say $x; $x =~ y/XZ//cd; say $x'
    X☺☺Z
    XZ
    

    See perlrun for any command switches that are unfamiliar to you.

    "it seems that the site replaced some of the above characters with HTML code, sorry for this"

    In general, you should put your code, data, and output within <code>...</code> tags: see "Writeup Formatting Tips" for more about that. However, with non-ASCII characters, it's best to use <pre>...</pre> tags, as I have in the example above. I also use <tt>...</tt> tags for displaying such characters in-line (e.g. in a sentence).

    — Ken

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1226629]
Approved by dorko
Front-paged by haukex
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others contemplating the Monastery: (2)
As of 2018-12-14 02:55 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    How many stories does it take before you've heard them all?







    Results (63 votes). Check out past polls.

    Notices?