Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

Re: Normalizing diacritics in (regex) search

by hippo (Archbishop)
on Nov 24, 2025 at 12:22 UTC ( [id://11166794]=note: print w/replies, xml ) Need Help??


in reply to Normalizing diacritics in (regex) search

Have you tried Text::Unidecode?


🦛

  • Comment on Re: Normalizing diacritics in (regex) search

Replies are listed 'Best First'.
Re^2: Normalizing diacritics in (regex) search
by Corion (Patriarch) on Nov 24, 2025 at 12:53 UTC

    I'm also very fond of Text::Unidecode, but it does slightly more. It also transliterates some non-Latin script into Latin, and it transliterates German umlauts to their German equivalents, like ä to ae.

    But for a quick first stab, using Text::Unidecode does 90% of what one wants.

Re^2: Normalizing diacritics in (regex) search
by LanX (Saint) on Nov 25, 2025 at 04:10 UTC
    As Corion said, it does a lot more. Probably too much for my use case.

    And it's implemented by having many translation tables which are (manually?) maintained by the author. The last version is from 2016.

    And I'd rather use unicode properties directly to always stay up to date.

    last but not least, it doesn't provide me equivalent classes for specific latin characters. Just one function unidecode to "flatten" all input to latin characters if possible.

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    see Wikisyntax for the Monastery

      last but not least, it doesn't provide me equivalent classes for specific latin characters. Just one function unidecode to "flatten" all input to latin characters if possible.

      Sorry, in that case I have misunderstood your requirements as I took it that this "flattening" is what you were after when you said "Of course I could do the normalization manually and map à á ä å ... -> a and so on." - never mind.


      🦛

        No! No need to apologize, I was asking for input.

        You just asked if I tried that module and I wanted to share my insights.*

        The unidecode mapping à á ä å ... -> a would force me to normalize all search data.

        The reverse a -> à á ä å allows to fix the search term. By replacing every a with a character class [àáäå] etc.

        Both approaches have their pro and cons, I prefer to have the choice. :)

        Cheers Rolf
        (addicted to the Perl Programming Language :)
        see Wikisyntax for the Monastery

        *) reworded

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11166794]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others wandering the Monastery: (5)
As of 2026-04-10 11:39 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?
    hippoepoptai's answer Re: how do I set a cookie and redirect was blessed by hippo!
    erzuuliAnonymous Monks are no longer allowed to use Super Search, due to an excessive use of this resource by robots.