http://www.perlmonks.org?node_id=481164

juo has asked for the wisdom of the Perl Monks concerning the following question:

Hello, I am looking for a way to replace any special character symbols with something else but I found that I even have troubles with typing some of them in my texteditor, like some symbols can exist in excel but if I copy paste then it will change it to something else. The characters could be anything, from any language also. for example :

% : Per

¡Ü : st

¡À : +/-

¦¸ : Ohm

¶þ : 2

¼« : not bad

¹Ü : good

Is their a way to get for each symbol/character a reference code that can be used in scripting and then look in a file for that code and then replace it with something else? I used to remember that each character is linked to a table and that table links can be used. I also notice that when I paste them in my edit screen they look good but got distorted when previewed.

Replies are listed 'Best First'.
Re: replacing special characters in file
by Joost (Canon) on Aug 05, 2005 at 10:28 UTC
    Perl used to only allow ASCII in identifiers, if you use you can "use utf8" in recent perls to indicate that you want to be able to use unicode identifiers.

    String literals in perl can be in any encoding, but how they will look in your editor is dependent on the editor (and possibly on markings in the file). Most modern editors support UTF-8 unicode encoding, and I'd guess that most win32 editors support the win32 specific encodings too.

    Input & output can be in many encodings, but in most cases, you need to specify what encoding you're going to use.

    See perluniintro, binmode, utf8 and Encode.

      I think Joost and polypompholyx are on the right track for the answer to this question, but it sounds like the OP is wrestling with character set challenges beyond just Perl, based on the description of "cutting/pasting" and "previewing". In essence, one needs Excel, the text editor, and the console (if that's where things are being printed) to all be "speaking" the same character encoding (and have a font that supports it, too) -- or at least one needs to take into account the differences between all these variables. Even if one gets the encodings right in Perl, the results could still look strange when viewed in some other program (console or editor). The OP will need to do more research into the unicode/character-code support of all the other links in the chain.

      -xdg

      Code written by xdg and posted on PerlMonks is public domain. It is provided as is with no warranties, express or implied, of any kind. Posted code may not have been tested. Use of posted code is at your own risk.

Re: replacing special characters in file
by polypompholyx (Chaplain) on Aug 05, 2005 at 10:38 UTC

    It depends on exactly what you want to do, but if you just want to strip out certain characters, you can do:

    s/[^\w]//g; # strip everything but 'word' characters

    s/[^[:ascii:]]//g; # strip everything but ASCII characters

    If you want to specifically substitute certain character (sequences), you can do this using hex escapes in the regex, if you can't type them directly in your text editor:

    s/\x{00A1}\x{00DC}/st/g; # replace upside-down-bang capital-u-umlaut with 'st'.

    You can look up the (Unicode) hex values for capital-u-umlaut and friends in Unibook.

    Bear in mind that the text you are editing may not be encoded in Unicode, and that even if it is, some characters may display differently in a terminal (particularly a DOS box) compared to how they will in a text file. Welcome to the inconsistent mess of character encoding standards.

      I have no idea if perlmonks prefers replying to the original thread or starting a new one if a certain amount of time has passed since the last reply, but I guess I'll find out soon enough. I too had odd characters like ‡ and † , that were removed flawlessly, thanks to...

       s/[^[:ascii:]]//g

      ... but it also removed any * in the file as well, which is an ascii value. Any reason why?

Re: replacing special characters in file
by blazar (Canon) on Aug 05, 2005 at 10:28 UTC
    If you just want to remove them or, say, replace them with underscores, this may be just as simple as
    s/[^[:print:]]/_/g;
    otherwise if you want to replace each one of them with a string of your choice, you could build up a suitable hash and then
    s/[^[:print:]]/$hash{$&}/g; # Some people dislike $&
    It's not entirely clear to me if you have such a list or if you're searching one. If you have one only for a limited number of chars, you may want to
    s/[^[:print:]]/$hash{$&} || '_'/ge; # or # s/[^[:print:]]/exists $hash{$&} ? $hash{$&} : '_'/ge;
    Or else you may want to use some URI escaping package e.g. URI::Escape.

    Update: reading Joost's reply, I realize that I may have completely misunderstood your question.

Re: replacing special characters in file
by anonymized user 468275 (Curate) on Aug 05, 2005 at 11:54 UTC
    If you want to transpose them into ASCII tokens that can be translated back as necessary, here is a simple script demonstrating one possible schema:
    open my $fh, "<$file"; read $fh, my $data, -s $file; my $transposed = ''; for ( my $l=length( $data ), my $i = 0; $i < $l; $i++ ) { my $chr = substr( $data, $i, 1 ); my $ascii = ord( $chr ); if ( $chr eq "\n" ) { $transposed .= $chr; next; } if ( ( $ascii < 32 ) or ( $ascii > 126 ) ) { # e.g. ctrlc becomes '\003' $transposed .= '\' . Lzro3( $ascii ); } else { # escape backslash to make it easier to translate back again ( $chr eq '\' ) and $chr .= '\'; $transposed .= $chr; } } sub Lzro3 { my $n = shift; ( $n < 10 ) and return '00' . $n; ( $n < 100 ) and return '0' . $n; return $n; }
    And to translate back again, just look for the backslash which will always be followed by either the three digit ascii code or a backslash.

    One world, one people

Re: replacing special characters in file
by graff (Chancellor) on Aug 05, 2005 at 22:21 UTC
    Here is a simpler variant of the approach suggested by anonymized user 468275 above. It will help you to diagnose the non-ASCII character content of a given file (byte values between 128 and 255), and give you a simple way to copy/paste the numeric references for (strings of) single-byte characters, so you can specify replacements for them. The following script is a simple stdin-stdout filter -- run it like this: "chr-filter.pl orig.file > viewable.file"
    #!/usr/bin/perl while (<>) { s/([\x80-\xff])/sprintf "\\x{%02x}",ord($1)/eg; print; }
    So, in the output, you'll see things like "\x{e8}" if the input contained an iso-8859-1 encoded version of "è", and so on.

    If the input data you're working with happens to be utf8-encoded, then it will be better to use "binmode( ':utf8' )" on the file handle before reading the data, and then you just treat the stuff like unicode characters (see perldoc perlunicode).

    Ideally, you'll be able to tell from the context around a give (string of) "\x{HH}" symbol(s) what sort of thing you want to replace it with.

      Thank you! :-)
Re: replacing special characters in file
by rupesh (Hermit) on Aug 08, 2005 at 04:11 UTC

    [juo,
    The way I see it, you need to output some printable characters, so that you can (or should be able to) get the original "special" characters for verification.

    In such cases, you should look at Mime::Base64 which would encode any character to a printable one based on a 65-character set.To get back the original string all you need to do is decode it. I was having a similar issue a few days ago, and ikegami suggested the same. Hope it works for you as well.

    Cheers,
    Rupesh.