UTF-8 problems again

by kepler (Scribe)
on Feb 02, 2019 at 19:39 UTC

kepler has asked for the wisdom of the Perl Monks concerning the following question:


I made a script to download some texts from the web (with success). However, I have some results of the type "Mich%E8le+Mercier" that should be "Michèle+Mercier" and others like "Michèle" that should appear like "Michèle". How can I decode/code the text in Perl in order to correctly appear in an html page? (in this case should appear "Michèle Mercier" and "Michèle").

Re: UTF-8 problems again
on Feb 02, 2019 at 23:25 UTC

    Hello Kepler,

    this depends on how your script downloads texts from the web. If you are using LWP::UserAgent, then this module can take care for the UTF-8 issues: The response object you get from a successful request by your user agent has a method decoded_content which delivers strings decoded according to the text's encoding as advertised by the server.

    If you don't use LWP::UA's capability to decode the HTML content, then you need to do it yourself: Web pages inform you, either in their HTTP headers, or (especially in HTML 5) in the body, about the encoding. Check for the Content-Type header or the meta-element defining the charset, and if it is UTF-8, then use either Encode or utf8::decode to decode it. Note that utf8::decode is available without using any module, and that it modifies its parameter in-place.

    A string like "Mich%E8le+Mercier" occurs most probably in a link: '%e8' is an url-encoded 'è'. The user agent doesn't decode this for you, that's up to the parser you are using to analyze the URL. Note that %e8 is not the UTF-8 encoding for '`e', which might be just fine if the link target expects ISO-8859-1 encoded links.

Re: UTF-8 problems again
on Feb 03, 2019 at 01:25 UTC
    How can I decode/code the text in Perl in order to correctly appear in an html page?
    I often find it is enough to use the HTML::Entities module which will encode your text for display in a HTML page:
    use HTML::Entities; # decode the text first in case some of it is already encoded decode_entities($text); # encode the text encode_entities($text);

