Replacing non ascii in string

IanD has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.

Re: Replacing non ascii in string
by Kenosis (Priest) on Jan 30, 2013 at 05:04 UTC

Perhaps Text::Unidecode would be helpful (please forgive the lack of complete code formatting, as doing so eliminates displaying the characters the module decodes):

use strict;
use warnings;
use utf8;
use Text::Unidecode;
[download]

my $string = q/‘ and ’ “ and ” and ä/;


print unidecode($string);
[download]

Output:

' and ' " and " and a

[reply]
[d/l]
[select]

Re: Replacing non ascii in string
by Athanasius (Archbishop) on Jan 30, 2013 at 04:38 UTC

Hello IanD, and welcome to the Monastery!

Try this:

14:33 >perl -wE "my $s = 'Australia’s ‘Powder Capital’'; $s =~ tr/‘’/'
+/; say $s;"
Australia's 'Powder Capital'

14:33 >
[download]

See tr{}{} in Quote and Quote like Operators.

Update: Likewise,

14:42 >perl -wE "my $t = 'and ... xxx said “This is a fantastic start 
+to the season”'; $t =~ tr/“”/\"/; say $t;"
and ... xxx said "This is a fantastic start to the season"

14:43 >
[download]

Or combined into one:

14:46 >perl -wE "my $s = qq[Australia’s ‘Powder Capital’\nand ... xxx 
+said “This is a fantastic start to the season”]; $s =~ tr/‘’“”/''\"\"
+/; say $s;"
Australia's 'Powder Capital'
and ... xxx said "This is a fantastic start to the season"

14:49 >
[download]

Hope that helps,

Athanasius <°(((>< contra mundum Iustus alius egestas vitae, eros Piratica,

[reply]
[d/l]
[select]

Re^2: Replacing non ascii in string

by IanD (Initiate) on Jan 31, 2013 at 03:19 UTC

Thanks, this does basically work but for some reason it is returning 3 x ' which I can't see why but it is working.

$data_file =~ tr/‘’/'/;

Even though the input is "Australia’s ‘Powder Capital’"

[reply]
[d/l]

Re^3: Replacing non ascii in string

by Athanasius (Archbishop) on Jan 31, 2013 at 04:10 UTC

I can’t reproduce this problem. Can you provide a complete but minimal script that demonstrates the behaviour you are seeing?

Please specify input and output precisely, and also show the output from perl -v.

Athanasius <°(((>< contra mundum Iustus alius egestas vitae, eros Piratica,

[reply]
[d/l]

Re^4: Replacing non ascii in string

by IanD (Initiate) on Jan 31, 2013 at 05:21 UTC

Re^3: Replacing non ascii in string

by Anonymous Monk on Jan 31, 2013 at 08:41 UTC

Sounds like an encoding problem. Is your string utf-8? (Encode, $decoded_str = decode('utf-8', $str)) Did you use utf8 to get your literals parsed as such?

(You see, those fancy apostrophes are represented as three bytes. If Perl thinks we're still in ascii-land (binary-land), it sees the transliteration as tr/\xe2\x80\x99/'/ -- effectively changing any of those three bytes to an apostrophe.)

[reply]
[d/l]
[select]

Re^4: Replacing non ascii in string

by IanD (Initiate) on Feb 05, 2013 at 05:54 UTC

Re: Replacing non ascii in string
by vinoth.ree (Monsignor) on Jan 30, 2013 at 04:41 UTC

This will remove the non ascii characters,also creates backup file in file.backup file.

You can do the job with a perl one liner: perl -i.backup -pe 's/[[:^ascii:]]//g;' file

Also, $str =~ s/[^!-~\s]//g; In the above, !-~ is a range which matches all characters between ! and ~. The range is set between ! and ~ because these are the first and last characters in the ASCII table.This does not include whitespace, so i added \s also.

[reply]
[d/l]
[select]

Re: Replacing non ascii in string
by Anonymous Monk on Jan 30, 2013 at 09:15 UTC

As down the track I am sure I will want to replace things like ä with a etc as well.

That doesn't sound like a good idea. ä is a letter in some languages, not just "a with diacritic". Writing a name without the "diacritic" could be downright wrong. (Yes, sports broadcasts frequently do it that way, but that doesn't mean you should.) Do you really need to pare all of your text down to 7-bit ASCII?

[reply]

Re^2: Replacing non ascii in string

by MidLifeXis (Monsignor) on Jan 30, 2013 at 13:24 UTC

While I agree with your general point, it is not an absolute. For some systems I interact with, yes. Things that only support 7-bit ASCII are still out there.

Update: added 'absolute' comment.

--MidLifeXis

[reply]

Re: Replacing non ascii in string
by naChoZ (Curate) on Jan 30, 2013 at 19:15 UTC

Hardly an elegant solution, but my little script will do exactly what you want.

#!/usr/bin/perl -n

#use strict;
#use warnings;
use charnames ();
use encoding "utf8";

$|++;

my $chars = {
    'HYPHEN'                      => '-',    #  \x{2010}
    'SOFT HYPHEN'                 => '-',    # \x{00AD}
    'MINUS SIGN'                  => '-',    #  \x{2212}
    'FIGURE DASH'                 => '-',    #  \x{2012}
    'ACUTE ACCENT'                => "'",    #  \x{00B4}
    'GRAVE ACCENT'                => "'",    #  \x{0060}
    'LEFT SINGLE QUOTATION MARK'  => "'",    #  \x{2018}
    'RIGHT SINGLE QUOTATION MARK' => "'",    #  \x{2019}
    'LEFT DOUBLE QUOTATION MARK'  => '"',    #  \x{201C}
    'RIGHT DOUBLE QUOTATION MARK' => '"',    #  \x{201D}
    'BOX DRAWINGS LIGHT VERTICAL' => '|',    #  \x{2502}
    'MULTIPLICATION SIGN'         => '*',    #  \x{00D7}
    'BACKSPACE'                   => '',     #  \x{0008}
    'DELETE'                      => '',     #  \x{0127}
};


# If the first character is an equal sign, skip it and
# display the identity of each remaining character.
#
if (/^=/) {
    chomp;
    for my $index ( 1 .. length($_) - 1 ) {

        my $char = substr( $_, $index++, 1 );

        print $char . " "
                    . ord( $char )
                    . " "
                    . sprintf( "\\%03o", ord($char) )
                    . " "
                    . sprintf( "\\x{%04X}", ord($char) )
                    . " = '"
                    . charnames::viacode( ord($char) )
                    . "'\n" ;

    }

} else {

    for my $cname ( keys %$chars ) {

        my $char = chr( charnames::vianame($cname) );
        s/$char/$chars->{$cname}/g;

    }

    print;

}
[download]

Run this and it sits there waiting for you to paste something to your terminal. Once it receives some input, it starts replacing characters by their unicode name with the values from the $chars hashref.

If you need to identify any unicode characters because you want to add to the hashref of replacement characters, press the = key first and then paste to your terminal.

--
Andy

[reply]
[d/l]
[select]


Clear questions and runnable code get the best and fastest answer
	PerlMonks