http://www.perlmonks.org?node_id=619792

MonkPaul has asked for the wisdom of the Perl Monks concerning the following question:

Hello,

I have a text file that contains text stripped from a PDF document. This text contains non-ascii characters that I have to remove before I can run it through some text-mining software.

I have looked at the ord function to remove the ascii values that are not in the basic ascii table, but I am not sure how to use this over the whole text file. I thought of parsing each line, then looking at each letter/non-letter in turn. I have also looked at the previous searches on text cleaning but these are just for stripping out letters and desired content - not non-ascii.

Does anybody have any recomendations for removing these chars?

many thanks,
MonkPaul

  • Comment on Removing Non-Ascii chars from text file

Replies are listed 'Best First'.
Re: Removing Non-Ascii chars from text file
by citromatik (Curate) on Jun 07, 2007 at 12:42 UTC

    You can do the job with a perl one-liner:

    perl -i.bk -pe 's/[^[:ascii:]]//g;' file

    This will remove all non ascii character from your file copying the original content in file.bk

    citromatik

      You also need to be aware of what the encoding of the file is and what encoding Perl defaults to on your system. For a UTF-8 file on Windows, I found that I needed to add "use open qw(:std :utf8);" before the "s///" command so that Perl would expect the input to be UTF-8.

      You've made my day :) busy with csv parsing and couldn't get rid of " € 1.000 " from one of the fields
        Sounds more like wrong encoding, though.
        لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ
      This works. I just don't want the backup file. How to do that.

        See perlrun on what the switches do. For your case especially relevant is the -i switch, which takes an optional parameter.

Re: Removing Non-Ascii chars from text file
by zentara (Archbishop) on Jun 07, 2007 at 12:22 UTC
    Look at how this works.
    #!/usr/bin/perl $s .= chr for 1..255; print $s,"\n\n"; $s =~ tr/\x20-\x7f//cd; print $s,"\n\n";

    I'm not really a human, but I play one on earth. Cogito ergo sum a bum
      ^\x20-\x7E This is not ASCII, this is real ascii: ^\x00-\x7F Otherwise it will trim out newlines and other special characters that are part of ascii table!

        Correct. ASCII "includes definitions for 128 characters: 33 are non-printing control characters... and 95 printable characters..."
        See this scanned copy of the original "American Standard Code for Information Interchange (ASCII)" from 1963, the 5th page in particular. This definition is also enshrined in Internet RFC 20.

        <c> ^\x20-\x7E <c> This is not ASCI

        Sure it is, 32 through 126 (precisely all the characters that aren't 32 through 126 )

Re: Removing Non-Ascii chars from text file
by rsriram (Hermit) on Jun 07, 2007 at 12:36 UTC

    Try this,

    $str =~ s/[^!-~\s]//g;

    In the above, !-~ is a range which matches all characters between ! and ~. The range is set between ! and ~ because these are the first and last characters in the ASCII table (Alt+033 for ! and Alt+126 for ~ in Windows). As this range does not include whitespace, \s is separately included. \t simply represents a tab character. \s is similar to \t but the metacharacter \s is a shorthand for a whole character class that matches any whitespace character. This includes space, tab, newline and carriage return.

    Or simply, $str !~ s/[^[:ascii:]]//g;

      Cool. This worked for me. Thanks.
Re: Removing Non-Ascii chars from text file
by bart (Canon) on Jun 08, 2007 at 10:07 UTC
    This text contains non-ascii characters that I have to remove before I can run it through some text-mining software.
    You don't expect to have to handle any accented characters? Those aren't "Ascii".
Re: Removing Non-Ascii chars from text file
by Anonymous Monk on Jun 08, 2007 at 02:26 UTC
    perl -pe '($c,$d)=(32,126); s/(.)/(ord($^N)>$c-1 and ord($^N)<$d+1)?$^ +N:""/ge;'
    (There are much better solutions available if you don't want to specify a range, but something like this is what I gathered from your post.)