Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight

Re^2: Removing Foreign Characters

by existem (Sexton)
on Jan 27, 2005 at 16:49 UTC ( #425621=note: print w/replies, xml ) Need Help??

in reply to Re: Removing Foreign Characters
in thread Removing Foreign Characters

this is exactly what i'm trying to do, at the moment I just have funny characters, when all I want is English... perhaps I will have to just translate them manually as well, all the other Encode stuff seems very confusing!

Replies are listed 'Best First'.
Re^3: Removing Foreign Characters
by graff (Chancellor) on Jan 28, 2005 at 06:17 UTC
    Here's a little script I cooked up not long ago to "deaccent" letters -- you need Perl version 5.8.0 or later to run it, and it assumes that your input text (from STDIN or file(s) named on the command line) is in utf-8:
    #!/usr/bin/perl -CDS use strict; require 5.008; my @charnames = grep /\tLATIN \S+ LETTER/, split( /^/, do 'unicore/Nam' ); my %accents; for my $c ( split //, qq/AEIOUCNYaeioucny/ ) { my $case = ( $c eq lc $c ) ? 'SMALL' : 'CAPITAL'; $accents{$c} = join( '', map { chr hex( substr $_, 0, 4 ) } grep /\tLATIN $case LETTER \U$c WITH/, @charnames ); } # now use each element of %accents as a character class: while (<>) { for my $c ( keys %accents ) { s/[$accents{$c}]/$c/g; } print; }
    If your original text is not utf8, well, you have to know what the encoding really is; then you can either find a way to convert to utf8 (e.g. there's an "iconv" tool on many systems, or you can use the Encode module in perl, which isn't that tough, really), OR you can hard-code all those conversions by hand instead of using the script shown above.

    Based on one of your replies, you would be happy with converting the accented characters to symbolic entity references (&aacute; and so on). I think your hard-coded hash is as good a solution as any for that, so long as the encoding you used to write the the perl code matches the encoding of your text data.

Re^3: Removing Foreign Characters
by g0n (Priest) on Jan 27, 2005 at 17:17 UTC
    The other solutions are to solve character encoding issues: you can have different binary sequences to mean the same character.

    For example: e acute might be one binary sequence in latin1, and a differnt binary sequence in UTF8 (and is, in fact).

    The problem with what you are trying to do, is that it is not translating between different representations of the same character (what people immediately think of) - you want to translate one character (e acute) into a totally different one (e no acute).

    I have some code to do this, but sadly not with me. I could post or mail it at the weekend.



      thanks for the help guys, I think i've kind of hacked this one ;) here's what i've done.

      This is actually PHP, I did it on the front end, rather than at the point of loading into the database.

      $trans = array( "" => "&Agrave;", "" => "&agrave;", "" => "&Aacute;", "" => "&aacute;", "" => "&Atilde;", "" => "&Igrave;", "" => "&igrave;", "" => "&Iacute;", "" => "&iacute;", "" => "&Icirc;", "" => "&icirc;", "" => "&Ograve;", "" => "&ograve;", "" => "&Oacute;", "" => "&oacute;", "" => "&Ocirc;", "" => "&ocirc;", "" => "&eacute;", "" => "&egrave;", "" => "&Egrave;", "" => "&Ugrave;", "" => "&ugrave;", "" => "&Uacute;", "" => "&uacute;", "" => "&Ucirc;", "" => "&ucirc;", "" => "", "" => "" ); $name = strtr($row["name"], $trans2);

      I'm off to the pub now to think about it a bit more ;)

      Quick poll of opinion:

      I've had a requirement to do this a couple of times, and evidently existem has now too. Is it worth modularising this for different encoding schemes as Text::StripAccent or some such name?


Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://425621]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others surveying the Monastery: (5)
As of 2023-01-27 10:43 GMT
Find Nodes?
    Voting Booth?

    No recent polls found