Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

testing if a string is ascii

by Anonymous Monk
on Sep 04, 2006 at 05:13 UTC ( #571003=perlquestion: print w/ replies, xml ) Need Help??
Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I want to be able to verify that a string contains only ascii characters, not ISO-8859-1, or UTF-8, or anything else. Is a regex like this sufficient?
$str !~ /[\x80-\xff]/;
I know that is good enough to eliminate ISO-8859-1 but I'm not sure of other encodings. Should I be using some module's function? If so, which one?

Comment on testing if a string is ascii
Download Code
Re: testing if a string is ascii (^)
by tye (Cardinal) on Sep 04, 2006 at 06:05 UTC

    No, that isn't enough. Go the other direction, negating the list of things that are what you want:

    $str !~ /[^\0-\x7f]/

    - tye        

      Or, $str !~ /[^[:ascii:]]/
      Actually, I think the better way would be to return true on the first non-ASCII character, and treat "true" as the condition to be avoided:
      sub nonascii { local $_ = shift; /[^\0-\x7f]/; # or /[^[:ascii:]]/ } for my $test ( split( " ", "foo bar b\xe1 baz" )) { next if ( nonascii( $test )); print "$test\n"; }
      (update: this is just a stylistic difference relative to  $str !~ /[^\0-\x7f]/ -- to avoid the cognitive challenge of the double-negative)
Re: testing if a string is ascii
by rsriram (Hermit) on Sep 04, 2006 at 06:22 UTC

    Hi, Assuming your string is in $str, try this:

    if ($str =~ /[^!-~\s]/g){print "Non-ASCII character found"}

    This will check for any other character apart from ASCII character.

      there are more charcters in ASCII below space than just \t, \r and \n

      s$$([},&%#}/&/]+}%&{})*;#$&&s&&$^X.($'^"%]=\&(|?*{%
      +.+=%;.#_}\&"^"-+%*).}%:##%}={~=~:.")&e&&s""`$''`"e
      Hi, Assuming your string is in $str, try this:

      if ($str =~ /^!-~\s/g){print "Non-ASCII character found"}

      This will check for any other character apart from ASCII character.

      hi,
      can you explain better what this statement means ?
      I didn't know what ^!-~\s will go
      regards, Enzo

        Hi Enzo, Here is what the regex means:

        ^ is a exclusion operator.

        !-~ is a range which matches all characters between ! and ~. The range is set between ! and ~ because these are the first and last characters in the ASCII table (Alt+033 for ! and Alt+126 for ~ in Windows). As this range does not include whitespace, \s is separately included. \t simply represents a tab character. \s is similar to \t but the metacharacter \s is a shorthand for a whole character class that matches any whitespace character. This includes space, tab, newline and carriage return.

        The meaning of the complete statement is "If anything which is not between the ASCII range of ! and ~ and if not a whitespace, test is true."

        Sriram

Re: testing if a string is ascii
by davido (Archbishop) on Sep 04, 2006 at 07:11 UTC

    How about using transliteration? Transliteration is marginally quicker than pattern matching. Here's a solution that will trigger the error condition if the string contains anything outside of \0-\x7f range. It works by using the /c modifier on a tr/// transliteration operator. Refresher course: the /c modifier complements the search list, and if no "replace" list is specified, one that exactly matches the search list is generated behind the scenes. That has the effect of leaving the original string untouched, only counting characters that match the criteria (or in this case, counting the ones that match the complement to the criteria specified, thanks to /c)

    for my $str ( "\x7f", "asdf", "asdf\x8f", "\x8f" ) { print "$str contans ", ( $str =~ tr/\0-\x7f//c ) ? "non-" : "only ", "ascii.\n"; }

    A lot of code there is just testing framework. The engine at work is this:

    tr/\0-\x7f//c

    If that tests positive, you've got non-ascii characters in your string. Details about tr/// can be found in perlop.


    Dave

      How about using transliteration? Transliteration is marginally quicker than pattern matching.

      I think your dogma got hit by some carma. How do you think using tr/// to hit every single character in a string will be faster than m// (without /g) that can immediately stop when it finds the first problem character?

      - tye        

Re: testing if a string is ascii
by harryf (Beadle) on Sep 04, 2006 at 09:31 UTC

    One question - what are you going to use the string for?

    If it's going to used in XML / HTML, you should also check there are no ASCII control codes other than \r \n and \t e.g.

    if ( $str =~ /[^\x09\x0A\x0D\x20-\x7E]/g ){ print "Contains invalid characters"; }

    See HOWTO Avoid Being Called a Bozo When Producing XML.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://571003]
Approved by BrowserUk
Front-paged by monkfan
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others contemplating the Monastery: (13)
As of 2014-08-27 18:31 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The best computer themed movie is:











    Results (248 votes), past polls