Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

Quick way to convert to ASCII

by kettle (Beadle)
on Jul 26, 2006 at 02:35 UTC ( #563687=perlquestion: print w/replies, xml ) Need Help??

kettle has asked for the wisdom of the Perl Monks concerning the following question:

Hi all, I am looking for a quick and easy way to convert UTF-8 or LATIN-1 characters to their closest ASCII equivalent. Thus an accented 'e' should be mapped to a 'regular no frills' ASCII 'e', and similarly an 'A' with a tilde over it should be mapped to a standard uppercase 'A'. I can use individual hex codes and map characters with a host of regexes, but this seem s like overkill. Any clever thoughts would be appreciated!

Replies are listed 'Best First'.
Re: Quick way to convert to ASCII
by blokhead (Monsignor) on Jul 26, 2006 at 04:14 UTC
    Text::Unidecode looks like it does exactly that. It's pure Perl, but since it's essentially a giant lookup table for all of Unicode, it's not small (748k).

    blokhead

      It gets the ligature right and has a great motto :) :

      MOTTO

      The Text::Unidecode motto is:
      It's better than nothing!

      ...in both meanings: 1) seeing the output of unidecode(...) is better than just having all font-unavailable Unicode characters replaced with ``?'''s, or rendered as gibberish; and 2) it's the worst, i.e., there's nothing that Text::Unidecode's algorithm is better than.

      DWIM is Perl's answer to Gödel
Re: Quick way to convert to ASCII
by GrandFather (Saint) on Jul 26, 2006 at 03:10 UTC

    At the end of the day there has to be a lookup. That can be fairly quick using the translation function:

    use warnings; use strict; my $str = <<'STR'; Les nafs githales htifs pondant Nol o il gle sont srs d'tre +dus et de voir leurs drles d'ufs abms STR my %xlateL = ( a => '', c => '', e => '', i => '', o => '', u => '' #... ); my %xlateU; $xlateU{uc $_} = uc ($xlateL{$_}) for keys %xlateL; #Generate the uppe +r case versions eval "\$str =~ tr/$xlateL{$_}/$_/;" for keys %xlateL; eval "\$str =~ tr/$xlateU{$_}/$_/;" for keys %xlateU; print $str;

    Prints:

    Les naifs githales hatifs pondant a Noel ou il gele sont surs d'etre +decus et de voir leurs droles d'ufs abimes

    Note that causes a little grief however. Using a regex rather than the translation and a seperate set of tables is probably the fix for that.

    This would make a good CPAN module when you've got it done. :)


    DWIM is Perl's answer to Gödel
Re: Quick way to convert to ASCII
by ikegami (Patriarch) on Jul 26, 2006 at 03:04 UTC

      I notice Text::StripAccents at least (I didn't find Text::Unaccent using ppm) suffers the problem. No great surprise that something written to handle accents doesn't handle ligatures, but somewhat disapointing.


      DWIM is Perl's answer to Gödel
Re: Quick way to convert to ASCII
by Thelonius (Priest) on Jul 26, 2006 at 12:31 UTC
    I happened across a table for this just yesterday (with Greek and Cyrillic transliterations, too), so here's some Perl from that table:
    # in-place sub asciiize { $_[0] =~ s/([^\0-\x7f])/exists($asciiize{$1})?$asciiize{$1}:"?"/eg; return $_[0]; } # returns new sub giveascii { asciiize(my $x = shift); }

    Edited by planetscape - added readmore tags

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://563687]
Approved by GrandFather
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chanting in the Monastery: (1)
As of 2022-10-02 16:52 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    My preferred way to holiday/vacation is:











    Results (11 votes). Check out past polls.

    Notices?