Hello Kepler,
this depends on how your script downloads texts from the web. If you are using LWP::UserAgent, then this module can take care for the UTF-8 issues: The response object you get from a successful request by your user agent has a method decoded_content which delivers strings decoded according to the text's encoding as advertised by the server.
If you don't use LWP::UA's capability to decode the HTML content, then you need to do it yourself: Web pages inform you, either in their HTTP headers, or (especially in HTML 5) in the body, about the encoding. Check for the Content-Type header or the meta-element defining the charset, and if it is UTF-8, then use either Encode or utf8::decode to decode it. Note that utf8::decode is available without using any module, and that it modifies its parameter in-place.
A string like "Mich%E8le+Mercier" occurs most probably in a link: '%e8' is an url-encoded 'è'. The user agent doesn't decode this for you, that's up to the parser you are using to analyze the URL. Note that %e8 is not the UTF-8 encoding for '`e', which might be just fine if the link target expects ISO-8859-1 encoded links.
| [reply] |
How can I decode/code the text in Perl in order to correctly appear in an html page?
I often find it is enough to use the HTML::Entities module which will encode your text for display in a HTML page:
use HTML::Entities;
# decode the text first in case some of it is already encoded
decode_entities($text);
# encode the text
encode_entities($text);
| [reply] [d/l] |