http://www.perlmonks.org?node_id=909862

ZWcarp has asked for the wisdom of the Perl Monks concerning the following question:

Hello All, I was wondering if there is a way to remove a é symbol (or special character or whatever) from a file using a one liner similar to

perl -i -pe 'tr/\015//d' yourprogram.pl

Which I have used to remove special character ^r. Also is their a list of numbers and their corresponding special characters somewhere so that you can edit any special character you might come across. Sorry if this is a dumb question I'm pretty new. Any help would be greatly appreciated!! Thanks

Replies are listed 'Best First'.
Re: Special Characters list
by planetscape (Chancellor) on Jun 15, 2011 at 21:39 UTC

      Ok I looked up the code...but the following doesn't seem to be working

      perl -i -pe 'tr/\xe9/FIX/d' test.txt

      or

      perl -i -pe 'tr/\00233/FIX/d' test.txt

      What am I doing wrong? Thanks so much for the help you guys are great

        tr/\xe9/FIX/d is the same as tr/\xE9/F/ and convert chr(0xE9) to "F".

        tr/\00233/FIX/d is the same as tr/3\002/IF/ and converts chr(002) to "F" and "3" to "I".

        What am I doing wrong?

        What are you trying to do?

        You previously alluded that you wanted to remove "é" characters that were encoded as E9.

        perl -i -pe 'tr/\xE9//d' yourprogram.pl

        But the "FIX" in the new code seems to indicate you want something else.

Re: Special Characters list
by ikegami (Patriarch) on Jun 15, 2011 at 21:35 UTC

    Wikipedia has charts for many encodings. Your "é" was probably encoded using iso-8859-1, iso-8859-15 or Windows-1252.

    If the text were to have been decoded, then you'd be dealing with Unicode codepoints. You can find the numbers of Unicode codepoints (in hex) on the Unicode code charts.

    Another example, Unicode codepoint 20AC: "€":

    iso-8859-15 A4 Windows-1252 80 UTF-8 E2 82 AC UTF-16le AC 20 (at an even offset)
Re: how to remove é
by 7stud (Deacon) on Jun 15, 2011 at 23:11 UTC

    You've opened a can of worms. You need to read up on "unicode". The bottom line is that you need to know the "encoding" of any data you read from a file. An "encoding" tells perl how many bytes each integer in your file occupies. Remember computers store characters as integers.

    Here's an example. Suppose these bytes are in your file:

    0000 0001 0000 1000

    If you tell perl that your file is encoded in such a way that the first integer occupies 1 byte, then perl will read the following for the first integer:

    0000 0001

    which is equivalent to 1 in decimal. However, if you tell perl that your file is encoded in such a way that the first integer occupies 2 bytes, then perl will read the following for the first integer:

    0000 0001 0000 1000

    which is equivalent to 8 + 256 = 264 in decimal. So depending on what encoding you specify, perl will read in a different integer(and again remember that the integers are just codes for characters).

    By the way, \015 is not the special character ^r (up arrow+r). \015 is the octal syntax for the decimal integer 13, which is the ascii code for a carriage return. The fact that you tried to remove them from a file is very suspect. Please explain why you were doing that.

      perl -i -pe 'tr/\015/\n/d' Quick fix for getting rid of Excel or other windows markup when reading in a unix or linux environment. this one just changes over the carriage returns as you said.