Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

Removing Non-Ascii chars from text file

by MonkPaul (Friar)
on Jun 07, 2007 at 12:09 UTC ( #619792=perlquestion: print w/ replies, xml ) Need Help??
MonkPaul has asked for the wisdom of the Perl Monks concerning the following question:

Hello,

I have a text file that contains text stripped from a PDF document. This text contains non-ascii characters that I have to remove before I can run it through some text-mining software.

I have looked at the ord function to remove the ascii values that are not in the basic ascii table, but I am not sure how to use this over the whole text file. I thought of parsing each line, then looking at each letter/non-letter in turn. I have also looked at the previous searches on text cleaning but these are just for stripping out letters and desired content - not non-ascii.

Does anybody have any recomendations for removing these chars?

many thanks,
MonkPaul

Comment on Removing Non-Ascii chars from text file
Re: Removing Non-Ascii chars from text file
by zentara (Archbishop) on Jun 07, 2007 at 12:22 UTC
    Look at how this works.
    #!/usr/bin/perl $s .= chr for 1..255; print $s,"\n\n"; $s =~ tr/\x20-\x7f//cd; print $s,"\n\n";

    I'm not really a human, but I play one on earth. Cogito ergo sum a bum
      ^\x20-\x7E This is not ASCII, this is real ascii: ^\x00-\x7F Otherwise it will trim out newlines and other special characters that are part of ascii table!

        <c> ^\x20-\x7E <c> This is not ASCI

        Sure it is, 32 through 126 (precisely all the characters that aren't 32 through 126 )

        Correct. ASCII "includes definitions for 128 characters: 33 are non-printing control characters... and 95 printable characters..."
        See this scanned copy of the original "American Standard Code for Information Interchange (ASCII)" from 1963, the 5th page in particular. This definition is also enshrined in Internet RFC 20.

Re: Removing Non-Ascii chars from text file
by rsriram (Hermit) on Jun 07, 2007 at 12:36 UTC

    Try this,

    $str =~ s/[^!-~\s]//g;

    In the above, !-~ is a range which matches all characters between ! and ~. The range is set between ! and ~ because these are the first and last characters in the ASCII table (Alt+033 for ! and Alt+126 for ~ in Windows). As this range does not include whitespace, \s is separately included. \t simply represents a tab character. \s is similar to \t but the metacharacter \s is a shorthand for a whole character class that matches any whitespace character. This includes space, tab, newline and carriage return.

    Or simply, $str !~ s/[^[:ascii:]]//g;

      Cool. This worked for me. Thanks.
Re: Removing Non-Ascii chars from text file
by citromatik (Curate) on Jun 07, 2007 at 12:42 UTC

    You can do the job with a perl one-liner:

    perl -i.bk -pe 's/[^[:ascii:]]//g;' file

    This will remove all non ascii character from your file copying the original content in file.bk

    citromatik

Re: Removing Non-Ascii chars from text file
by Anonymous Monk on Jun 08, 2007 at 02:26 UTC
    perl -pe '($c,$d)=(32,126); s/(.)/(ord($^N)>$c-1 and ord($^N)<$d+1)?$^ +N:""/ge;'
    (There are much better solutions available if you don't want to specify a range, but something like this is what I gathered from your post.)
Re: Removing Non-Ascii chars from text file
by bart (Canon) on Jun 08, 2007 at 10:07 UTC
    This text contains non-ascii characters that I have to remove before I can run it through some text-mining software.
    You don't expect to have to handle any accented characters? Those aren't "Ascii".

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://619792]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chilling in the Monastery: (6)
As of 2014-08-31 04:56 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The best computer themed movie is:











    Results (294 votes), past polls