Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

Playing with extended chars

by deibyz (Hermit)
on Sep 27, 2004 at 10:53 UTC ( [id://394114]=perlquestion: print w/replies, xml ) Need Help??

deibyz has asked for the wisdom of the Perl Monks concerning the following question:

Hello monks,

I'm using Text::Query::Advanced to let the user search in a number of documents. The problem I have is that most of these documents are written in Spanish (yes, I'm Spanish, that's the reason of my bad English ;)), and they have "funny" characters, i.e.: áéíóú... . The problem is that the search "camion" should match the word "camión" as well as "camion", so I'm trying to figure out a simple way to get rid of those characters.

A simple substitution may work:

s/á/a/g; s/é/e/g; ... s/Ú/u/g;

But that would make it too slow, as it would have to do lot of passes through the string (maybe a long string), and I have to be aware of more characters in a future (â, ä, à, ...)

I've tried the tr/áéíóú/aeiou/ solution, but as "á" is a two byte character, it doesn't work.

I've read perluniintro and perlunicode, but I've not found anything that can help me.

Any ideas are welcome.

Thanks in advance,

deibyz

Edited: Title changed.

Replies are listed 'Best First'.
Re: Playing with "funny" chars
by Eyck (Priest) on Sep 27, 2004 at 12:19 UTC

    I would just use /[aąAĄ]/ instead of just /a/ in patterns that you're using.

    Besides that, have you tried use locale?

    And you should remember to set LC_CTYPE or LC_ALL beforehand...

    Also, question about tr... could be solved using extended regexpes, that is - you match a class of chars, and for replacement you call routine that replaces this with correct char. This solves multiple passess problem.

Re: Playing with "funny" chars
by cog (Parson) on Sep 27, 2004 at 15:09 UTC
    I use this and it works:

    y[áàãâäÁÀÃÂÄéèêëÉÈÊËíìîïÍÌÎÏóòõôöÓÒÔÕÖúùûüÚÙÛÜçÇ] [aaaaaAAAAAeeeeEEEEiiiiIIIIoooooOOOOOuuuuUUUUcC]
      Could you please tell me some details about configuration, platform, etc...

      #!/usr/local/bin/perl use strict; use warnings; $a = 'áéíóú'; $a =~ tr{áéíóú} {aeiou}s; print $a; __OUTPUT__ aeaoauauau
      I'm using perl5.8.5 on RHAS (perl 5.8.0 come with the distro, but had problems with unicode).

      Thanks

        I'm using perl v5.8.4 on a Red Hat Linux 9.0

        Here's my complete script:

        #!/usr/bin/perl -pw use strict; y[áàãâäÁÀÃÂÄéèêëÉÈÊËíìîïÍÌÎÏóòõôöÓÒÔÕÖúùûüÚÙÛÜçÇ] [aaaaaAAAAAeeeeEEEEiiiiIIIIoooooOOOOOuuuuUUUUcC]

        Your script produces a correct output in my machine ("aeiou")...

Re: Playing with "funny" chars
by mischief (Hermit) on Sep 27, 2004 at 12:57 UTC
      (oops, I wanted to reply to the first post but clicked here by accident ;) ).

      My recommendation is to use perl 5.8.0 or more recent and look at perldoc Encode, perldoc open, and perldoc -f open. If tr doesn't work because you have the characters encoded in two bytes, you can do

      $s = decode_utf8($s);

      That will convert the string into the internal representation where characters are characters and you don't have to worry about how many bytes they need for encoding.

        I think the problem is not on the string (I'm using perl5.8.5, because 5.8.0 had some bugs in RedHat), but on the tr operator itself.

        The first attemp works like this:

        perl -e '$_="áéíóú";tr/áéíóú/aeiou/;print' aeaoauauau
        It seems that "á" is treated as two characters, maybe "´" and "a", and each one get one different matching char ( "a" and "e").

        BTW, encode and decode functions return values that make me think that the string is well formed, and that is tr// who's making wrong things. Am I too lost?

Re: Playing with extended chars
by chanio (Priest) on Sep 27, 2004 at 23:14 UTC
    May be it is something from outside perl. Have you configured your locale variables in your system. Try set and see your locales LC_ALL, etc...

    It happened to me (also Spanish) and it was a mess to distinguish if the problems was in perl or from my local environment variables. Please, check all your default variables before trying with perl's.

    use POSIX qw(strftime setlocale LC_ALL LC_CTYPE); my ($loc) = POSIX::setlocale( &POSIX::LC_ALL, 'es_ES.ISO8859-1' ); my ($now_string) = strftime "%a %b %e %H:%M:%S %Y", localtime; my ($fecBita) = strftime "%Y-%m-%d %H:%M:%S", localtime;
    Your code is Ok!

    .{\('v')/}
    _`(___)' __________________________
    Wherever I lay my KNOPPIX disk, a new FREE LINUX nation could be established.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://394114]
Approved by Corion
Front-paged by Arunbear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others cooling their heels in the Monastery: (4)
As of 2024-04-18 05:33 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found