Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

Replacing non ascii in string

by IanD (Initiate)
on Jan 30, 2013 at 04:26 UTC ( [id://1015956]=perlquestion: print w/replies, xml ) Need Help??

IanD has asked for the wisdom of the Perl Monks concerning the following question:

OK I have searched and searched here and can't find a solution for this.

I have a string that contains copied and pasted text from word or some mac thing and it contains non ascii apostrophies and quotes.

eg (all in one long string):
Australia’s ‘Powder Capital’ and ... xxx said “This is a fantastic start to the season”
I want to replace all the ‘ and ’ with the ascii ' and the “ and ” with the ascii "

I can get rid of them with this:
$data_string =~s/[^[:ascii:]]//g;

But I want to replace them, not remove them and can't for the life of me work out the right regex to do this.

Also what reference do you use for the lookup. I have been using this:
http://www.ascii.cl/htmlcodes.htm

Which could be my problem. As down the track I am sure I will want to replace things like ä with a etc as well.

Thanks.

Replies are listed 'Best First'.
Re: Replacing non ascii in string
by Kenosis (Priest) on Jan 30, 2013 at 05:04 UTC

    Perhaps Text::Unidecode would be helpful (please forgive the lack of complete code formatting, as doing so eliminates displaying the characters the module decodes):

    use strict; use warnings; use utf8; use Text::Unidecode;

    my $string = q/‘ and ’ “ and ” and ä/;

    print unidecode($string);

    Output:

    ' and ' " and " and a

Re: Replacing non ascii in string
by Athanasius (Archbishop) on Jan 30, 2013 at 04:38 UTC

    Hello IanD, and welcome to the Monastery!

    Try this:

    14:33 >perl -wE "my $s = 'Australia’s ‘Powder Capital’'; $s =~ tr/‘’/' +/; say $s;" Australia's 'Powder Capital' 14:33 >

    See tr{}{} in Quote and Quote like Operators.

    Update: Likewise,

    14:42 >perl -wE "my $t = 'and ... xxx said “This is a fantastic start +to the season”'; $t =~ tr/“”/\"/; say $t;" and ... xxx said "This is a fantastic start to the season" 14:43 >

    Or combined into one:

    14:46 >perl -wE "my $s = qq[Australia’s ‘Powder Capital’\nand ... xxx +said “This is a fantastic start to the season”]; $s =~ tr/‘’“”/''\"\" +/; say $s;" Australia's 'Powder Capital' and ... xxx said "This is a fantastic start to the season" 14:49 >

    Hope that helps,

    Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

      Thanks, this does basically work but for some reason it is returning 3 x ' which I can't see why but it is working.

      ie
      $data_file =~ tr/‘’/'/;

      returns
      Australia'''s '''Powder Capital'''

      Even though the input is "Australia’s ‘Powder Capital’"

        I can’t reproduce this problem. Can you provide a complete but minimal script that demonstrates the behaviour you are seeing?

        Please specify input and output precisely, and also show the output from perl -v.

        Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

        Sounds like an encoding problem. Is your string utf-8? (Encode, $decoded_str = decode('utf-8', $str)) Did you use utf8 to get your literals parsed as such?

        (You see, those fancy apostrophes are represented as three bytes. If Perl thinks we're still in ascii-land (binary-land), it sees the transliteration as tr/\xe2\x80\x99/'/ -- effectively changing any of those three bytes to an apostrophe.)

Re: Replacing non ascii in string
by vinoth.ree (Monsignor) on Jan 30, 2013 at 04:41 UTC

    This will remove the non ascii characters,also creates backup file in file.backup file.

    You can do the job with a perl one liner:   perl -i.backup -pe 's/[[:^ascii:]]//g;' file

    Also, $str =~ s/[^!-~\s]//g; In the above, !-~ is a range which matches all characters between ! and ~. The range is set between ! and ~ because these are the first and last characters in the ASCII table.This does not include whitespace, so i added \s also.

Re: Replacing non ascii in string
by Anonymous Monk on Jan 30, 2013 at 09:15 UTC

    As down the track I am sure I will want to replace things like ä with a etc as well.

    That doesn't sound like a good idea. ä is a letter in some languages, not just "a with diacritic". Writing a name without the "diacritic" could be downright wrong. (Yes, sports broadcasts frequently do it that way, but that doesn't mean you should.) Do you really need to pare all of your text down to 7-bit ASCII?

      While I agree with your general point, it is not an absolute. For some systems I interact with, yes. Things that only support 7-bit ASCII are still out there.

      Update: added 'absolute' comment.

      --MidLifeXis

Re: Replacing non ascii in string
by naChoZ (Curate) on Jan 30, 2013 at 19:15 UTC

    Hardly an elegant solution, but my little script will do exactly what you want.

    #!/usr/bin/perl -n #use strict; #use warnings; use charnames (); use encoding "utf8"; $|++; my $chars = { 'HYPHEN' => '-', # \x{2010} 'SOFT HYPHEN' => '-', #­ \x{00AD} 'MINUS SIGN' => '-', # \x{2212} 'FIGURE DASH' => '-', # \x{2012} 'ACUTE ACCENT' => "'", # \x{00B4} 'GRAVE ACCENT' => "'", # \x{0060} 'LEFT SINGLE QUOTATION MARK' => "'", # \x{2018} 'RIGHT SINGLE QUOTATION MARK' => "'", # \x{2019} 'LEFT DOUBLE QUOTATION MARK' => '"', # \x{201C} 'RIGHT DOUBLE QUOTATION MARK' => '"', # \x{201D} 'BOX DRAWINGS LIGHT VERTICAL' => '|', # \x{2502} 'MULTIPLICATION SIGN' => '*', # \x{00D7} 'BACKSPACE' => '', # \x{0008} 'DELETE' => '', # \x{0127} }; # If the first character is an equal sign, skip it and # display the identity of each remaining character. # if (/^=/) { chomp; for my $index ( 1 .. length($_) - 1 ) { my $char = substr( $_, $index++, 1 ); print $char . " " . ord( $char ) . " " . sprintf( "\\%03o", ord($char) ) . " " . sprintf( "\\x{%04X}", ord($char) ) . " = '" . charnames::viacode( ord($char) ) . "'\n" ; } } else { for my $cname ( keys %$chars ) { my $char = chr( charnames::vianame($cname) ); s/$char/$chars->{$cname}/g; } print; }

    Run this and it sits there waiting for you to paste something to your terminal. Once it receives some input, it starts replacing characters by their unicode name with the values from the $chars hashref.

    If you need to identify any unicode characters because you want to add to the hashref of replacement characters, press the = key first and then paste to your terminal.

    --
    Andy

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1015956]
Approved by vinoth.ree
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others goofing around in the Monastery: (5)
As of 2024-04-19 13:39 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found