Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid

Re^4: keeping diacritical marks in a string

by Foxpond Hollow (Sexton)
on Oct 09, 2009 at 06:51 UTC ( #800172=note: print w/replies, xml ) Need Help??

in reply to Re^3: keeping diacritical marks in a string
in thread keeping diacritical marks in a string

Thank you! You were right, one of the things I forgot to mention was that I was doing a substitution later to remove any punctuation characters, and it was:


I didn't realize that accented letters didn't count in the \w match, I figured they were still alphanumeric. Well that's kind of annoying. I was using that to normalize the string and remove anything like commas and semicolons. Now I have to make a list of all the characters I want to remove, instead of being able to just specify the ones I want to keep. Oh well, at least we've found the problem. Still though, is there some way to convert accented letters to just remove the accent and keep the letter? It would seem a better solution than listing out everything that is not a letter, number, space, or letter with an accent.

Replies are listed 'Best First'.
Re^5: keeping diacritical marks in a string
by FalseVinylShrub (Chaplain) on Oct 09, 2009 at 10:42 UTC

    OK. That's good.

    Actually, I don't have a copy of Perl to verify but I think you "should" be able to get that substitution to work as you intended.

    I think the fact that it doesn't work indicates that perl is not treating the string as utf-8. Basically there is a flag that's stored in the data structure. If it's not set, it will be treated just as a string of bytes and character operations like uc() and your substitution will work in ASCII mode (actually, ISO-8859-1 I've just learned - see below).

    There are a number of ways to get the string to be treated as utf-8, and I'm not sure which ones are "correct" in this situation. But try doing this, after you get the $HTML and before you start doing operations on it:

    use Encode; # ... $HTML = decode_utf8($HTML);

    You can also use a more brute-force approach:


    I think the decode method is preferred, but perhaps someone else will correct / confirm.

    It is a complex topic, but the following documents are a good place to start:

    I hope this solves your problem. Keep us posted...


      Unfortunately both suggested methods seem to change the accented characters into question marks. Which (somewhat ironically), just raises more questions.

        Oh dear. I was sure that was going to work.

        Are you sure that it's changing them to question marks? i.e. are you sure it's not just that your terminal can't display the unicode characters? Sorry for stating the obvious - but you need to view the output in a browser, or something else that can display unicode, or check the hex values to see if they're correct.

        Alternatively, you could work around the problem by doing the input cleanup in a different way, but I think your original approach was correct and it should work if we can get perl to treat the strings as unicode.

        The fact that it does something different does make me think that it might be working, but it's showing up another problem somewhere.

        I might not be online much for a couple of days (I'm actually in an internet cafe in Vientiane, Laos...) but I would suggest drawing attention to this thread in the chatterbox at a busy time to get someone else to look at it. It seems to have gone quiet. Or perhaps it would be justified to start a new thread with your problem more narrowed down.

        Good luck...

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://800172]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others examining the Monastery: (2)
As of 2022-05-24 00:41 GMT
Find Nodes?
    Voting Booth?
    Do you prefer to work remotely?

    Results (82 votes). Check out past polls.