Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

Re^3: keeping diacritical marks in a string

by FalseVinylShrub (Chaplain)
on Oct 09, 2009 at 06:06 UTC ( [id://800166]=note: print w/replies, xml ) Need Help??


in reply to Re^2: keeping diacritical marks in a string
in thread keeping diacritical marks in a string

Hmm

I can't see the obvious source of the problem. I think you need to dump out the result of the request before any processing and be sure exactly where the special characters are being lost. i.e. is it coming correctly out of LWP, is it the regex, could it be the MARC:: module, etc.

As graff said it shouldn't be losing these characters, but there are a number of places where things can go wrong.

It's all a bit complicated and I can't think of a good guide to it at the moment. On the other hand, I've never heard of Perl completely stripping special characters because of an encoding problem - normally, you would get a multi-byte utf-8 character treated as 2 or 3 characters if the encoding is not set correctly. So I suspect an error in some code somewhere - could it be that something is validating input and stripping out characters it doesn't think are "safe"...?

Sorry I can't be of more help. Try to narrow it down to where they disappear and it will be solved eventually.

  • Comment on Re^3: keeping diacritical marks in a string

Replies are listed 'Best First'.
Re^4: keeping diacritical marks in a string
by Foxpond Hollow (Sexton) on Oct 09, 2009 at 06:51 UTC
    Thank you! You were right, one of the things I forgot to mention was that I was doing a substitution later to remove any punctuation characters, and it was:

    s/[^\w\s]//g


    I didn't realize that accented letters didn't count in the \w match, I figured they were still alphanumeric. Well that's kind of annoying. I was using that to normalize the string and remove anything like commas and semicolons. Now I have to make a list of all the characters I want to remove, instead of being able to just specify the ones I want to keep. Oh well, at least we've found the problem. Still though, is there some way to convert accented letters to just remove the accent and keep the letter? It would seem a better solution than listing out everything that is not a letter, number, space, or letter with an accent.

      OK. That's good.

      Actually, I don't have a copy of Perl to verify but I think you "should" be able to get that substitution to work as you intended.

      I think the fact that it doesn't work indicates that perl is not treating the string as utf-8. Basically there is a flag that's stored in the data structure. If it's not set, it will be treated just as a string of bytes and character operations like uc() and your substitution will work in ASCII mode (actually, ISO-8859-1 I've just learned - see below).

      There are a number of ways to get the string to be treated as utf-8, and I'm not sure which ones are "correct" in this situation. But try doing this, after you get the $HTML and before you start doing operations on it:

      use Encode; # ... $HTML = decode_utf8($HTML);

      You can also use a more brute-force approach:

      utf8::upgrade($HTML);

      I think the decode method is preferred, but perhaps someone else will correct / confirm.

      It is a complex topic, but the following documents are a good place to start:

      I hope this solves your problem. Keep us posted...

      FVS

        Unfortunately both suggested methods seem to change the accented characters into question marks. Which (somewhat ironically), just raises more questions.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://800166]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others rifling through the Monastery: (5)
As of 2024-04-23 11:25 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found