UTF-8 problems again

kepler has asked for the wisdom of the Perl Monks concerning the following question:

Greetings,

I made a script to download some texts from the web (with success). However, I have some results of the type "Mich%E8le+Mercier" that should be "Michèle+Mercier" and others like "MichÃ¨le" that should appear like "Michèle". How can I decode/code the text in Perl in order to correctly appear in an html page? (in this case should appear "Michèle Mercier" and "Michèle").

Best regards,

Kepler

Comment on UTF-8 problems again

Replies are listed 'Best First'.
Re: UTF-8 problems again by haj (Vicar) on Feb 02, 2019 at 23:25 UTC
Hello Kepler, this depends on how your script downloads texts from the web. If you are using LWP::UserAgent, then this module can take care for the UTF-8 issues: The response object you get from a successful request by your user agent has a method `decoded_content` which delivers strings decoded according to the text's encoding as advertised by the server. If you don't use LWP::UA's capability to decode the HTML content, then you need to do it yourself: Web pages inform you, either in their HTTP headers, or (especially in HTML 5) in the body, about the encoding. Check for the Content-Type header or the meta-element defining the charset, and if it is UTF-8, then use either Encode or `utf8::decode` to decode it. Note that `utf8::decode` is available without using any module, and that it modifies its parameter in-place. A string like `"Mich%E8le+Mercier"` occurs most probably in a link: '%e8' is an url-encoded 'è'. The user agent doesn't decode this for you, that's up to the parser you are using to analyze the URL. Note that `%e8` is not the UTF-8 encoding for '`e', which might be just fine if the link target expects ISO-8859-1 encoded links.	[reply]
Re: UTF-8 problems again by tangent (Parson) on Feb 03, 2019 at 01:25 UTC
How can I decode/code the text in Perl in order to correctly appear in an html page? I often find it is enough to use the HTML::Entities module which will encode your text for display in a HTML page: `use HTML::Entities; # decode the text first in case some of it is already encoded decode_entities($text); # encode the text encode_entities($text);` [download]	[reply] [d/l]

Back to Seekers of Perl Wisdom