http://www.perlmonks.org?node_id=1229299

kepler has asked for the wisdom of the Perl Monks concerning the following question:

Greetings,

I made a script to download some texts from the web (with success). However, I have some results of the type "Mich%E8le+Mercier" that should be "Michèle+Mercier" and others like "Michèle" that should appear like "Michèle". How can I decode/code the text in Perl in order to correctly appear in an html page? (in this case should appear "Michèle Mercier" and "Michèle").

Best regards,

Kepler

Replies are listed 'Best First'.
Re: UTF-8 problems again
by haj (Vicar) on Feb 02, 2019 at 23:25 UTC

    Hello Kepler,

    this depends on how your script downloads texts from the web. If you are using LWP::UserAgent, then this module can take care for the UTF-8 issues: The response object you get from a successful request by your user agent has a method decoded_content which delivers strings decoded according to the text's encoding as advertised by the server.

    If you don't use LWP::UA's capability to decode the HTML content, then you need to do it yourself: Web pages inform you, either in their HTTP headers, or (especially in HTML 5) in the body, about the encoding. Check for the Content-Type header or the meta-element defining the charset, and if it is UTF-8, then use either Encode or utf8::decode to decode it. Note that utf8::decode is available without using any module, and that it modifies its parameter in-place.

    A string like "Mich%E8le+Mercier" occurs most probably in a link: '%e8' is an url-encoded 'è'. The user agent doesn't decode this for you, that's up to the parser you are using to analyze the URL. Note that %e8 is not the UTF-8 encoding for '`e', which might be just fine if the link target expects ISO-8859-1 encoded links.

Re: UTF-8 problems again
by tangent (Parson) on Feb 03, 2019 at 01:25 UTC
    How can I decode/code the text in Perl in order to correctly appear in an html page?
    I often find it is enough to use the HTML::Entities module which will encode your text for display in a HTML page:
    use HTML::Entities; # decode the text first in case some of it is already encoded decode_entities($text); # encode the text encode_entities($text);