Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery

Re^2: keeping diacritical marks in a string

by Foxpond Hollow (Sexton)
on Oct 09, 2009 at 02:07 UTC ( #800144=note: print w/replies, xml ) Need Help??

in reply to Re: keeping diacritical marks in a string
in thread keeping diacritical marks in a string

I don't know if it alerts you when a post you've commented on is updated, so in case it doesn't, I've updated the post with the info you asked for. Note that the second update has the correct URL and you should ignore the URL in the first update.
  • Comment on Re^2: keeping diacritical marks in a string

Replies are listed 'Best First'.
Re^3: keeping diacritical marks in a string
by FalseVinylShrub (Chaplain) on Oct 09, 2009 at 06:06 UTC


    I can't see the obvious source of the problem. I think you need to dump out the result of the request before any processing and be sure exactly where the special characters are being lost. i.e. is it coming correctly out of LWP, is it the regex, could it be the MARC:: module, etc.

    As graff said it shouldn't be losing these characters, but there are a number of places where things can go wrong.

    It's all a bit complicated and I can't think of a good guide to it at the moment. On the other hand, I've never heard of Perl completely stripping special characters because of an encoding problem - normally, you would get a multi-byte utf-8 character treated as 2 or 3 characters if the encoding is not set correctly. So I suspect an error in some code somewhere - could it be that something is validating input and stripping out characters it doesn't think are "safe"...?

    Sorry I can't be of more help. Try to narrow it down to where they disappear and it will be solved eventually.

      Thank you! You were right, one of the things I forgot to mention was that I was doing a substitution later to remove any punctuation characters, and it was:


      I didn't realize that accented letters didn't count in the \w match, I figured they were still alphanumeric. Well that's kind of annoying. I was using that to normalize the string and remove anything like commas and semicolons. Now I have to make a list of all the characters I want to remove, instead of being able to just specify the ones I want to keep. Oh well, at least we've found the problem. Still though, is there some way to convert accented letters to just remove the accent and keep the letter? It would seem a better solution than listing out everything that is not a letter, number, space, or letter with an accent.

        OK. That's good.

        Actually, I don't have a copy of Perl to verify but I think you "should" be able to get that substitution to work as you intended.

        I think the fact that it doesn't work indicates that perl is not treating the string as utf-8. Basically there is a flag that's stored in the data structure. If it's not set, it will be treated just as a string of bytes and character operations like uc() and your substitution will work in ASCII mode (actually, ISO-8859-1 I've just learned - see below).

        There are a number of ways to get the string to be treated as utf-8, and I'm not sure which ones are "correct" in this situation. But try doing this, after you get the $HTML and before you start doing operations on it:

        use Encode; # ... $HTML = decode_utf8($HTML);

        You can also use a more brute-force approach:


        I think the decode method is preferred, but perhaps someone else will correct / confirm.

        It is a complex topic, but the following documents are a good place to start:

        I hope this solves your problem. Keep us posted...


Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://800144]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others avoiding work at the Monastery: (5)
As of 2022-05-17 14:35 GMT
Find Nodes?
    Voting Booth?
    Do you prefer to work remotely?

    Results (66 votes). Check out past polls.