Re: Removing Non-Ascii chars from text file
by citromatik (Curate) on Jun 07, 2007 at 12:42 UTC
|
perl -i.bk -pe 's/[^[:ascii:]]//g;' file
This will remove all non ascii character from your file copying the original content in file.bk
citromatik
| [reply] [d/l] |
|
| [reply] |
|
You've made my day :)
busy with csv parsing and couldn't get rid of " € 1.000 " from one of the fields
| [reply] |
|
Sounds more like wrong encoding, though.
| [reply] |
|
This works. I just don't want the backup file. How to do that.
| [reply] |
|
| [reply] [d/l] |
Re: Removing Non-Ascii chars from text file
by zentara (Archbishop) on Jun 07, 2007 at 12:22 UTC
|
#!/usr/bin/perl
$s .= chr for 1..255;
print $s,"\n\n";
$s =~ tr/\x20-\x7f//cd;
print $s,"\n\n";
| [reply] [d/l] |
|
^\x20-\x7E
This is not ASCII, this is real ascii:
^\x00-\x7F
Otherwise it will trim out newlines and other special characters that are part of ascii table!
| [reply] |
|
Correct. ASCII "includes definitions for 128 characters: 33 are non-printing control characters... and 95 printable characters..."
See this scanned copy of the original "American Standard Code for Information Interchange (ASCII)" from 1963, the 5th page in particular. This definition is also enshrined in Internet RFC 20.
| [reply] |
|
<c> ^\x20-\x7E <c> This is not ASCI Sure it is, 32 through 126 (precisely all the characters that aren't 32 through 126 )
| [reply] |
Re: Removing Non-Ascii chars from text file
by rsriram (Hermit) on Jun 07, 2007 at 12:36 UTC
|
Try this,
$str =~ s/[^!-~\s]//g;
In the above, !-~ is a range which matches all characters between ! and ~. The range is set between ! and ~ because these are the first and last characters in the ASCII table (Alt+033 for ! and Alt+126 for ~ in Windows). As this range does not include whitespace, \s is separately included. \t simply represents a tab character. \s is similar to \t but the metacharacter \s is a shorthand for a whole character class that matches any whitespace character. This includes space, tab, newline and carriage return.
Or simply, $str !~ s/[^[:ascii:]]//g;
| [reply] [d/l] [select] |
|
Cool. This worked for me. Thanks.
| [reply] |
Re: Removing Non-Ascii chars from text file
by bart (Canon) on Jun 08, 2007 at 10:07 UTC
|
This text contains non-ascii characters that I have to remove before I can run it through some text-mining software.
You don't expect to have to handle any accented characters? Those aren't "Ascii".
| [reply] |
Re: Removing Non-Ascii chars from text file
by Anonymous Monk on Jun 08, 2007 at 02:26 UTC
|
perl -pe '($c,$d)=(32,126); s/(.)/(ord($^N)>$c-1 and ord($^N)<$d+1)?$^
+N:""/ge;'
(There are much better solutions available if you don't want to specify a range, but something like this is what I gathered from your post.) | [reply] [d/l] |