Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW

Percent encoding of URIs with UTF-8 characters

by reisinge (Friar)
on Mar 21, 2013 at 13:02 UTC ( #1024748=perlquestion: print w/replies, xml ) Need Help??
reisinge has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks,

I have a list of UTF-8 encoded names in a file:

$ file -i names.txt names.txt: text/plain charset=utf-8 $ cat names.txt Ján Slota Peter Kažimír Alojz Hlina František Mikloško Ján Počiatek

I want to check whether these names are associated with some Slovak companies. I want to do it by running this script against the Business register (there is no API, AFAIK). The problem is, I'm not getting the expected results for all names (the script works just for the third name/line in the file). I guess it is because of the URI encoding done by URI::Encode (line 26, 27 in the script) - for example for the second name from the file I get: +ENO=Peter&SID=0&T=f0&R=on
and the portal is expecting (I get this by filling in the form on the portal): +f0&R=on
I read I shouldn't even need to use URI::Encode most of the time. I have tried without it and with URI::Escape - without success. Can you show me the way? Thanks.

Excellence is an art won by training and habituation: we do not act rightly because we have virtue or excellence, but we rather have these because we have acted rightly. -- Will Durant

Replies are listed 'Best First'.
Re: Percent encoding of URIs with UTF-8 characters
by daxim (Chaplain) on Mar 21, 2013 at 13:27 UTC
    The URI module does not help you because is not conforming to the standard RFC 3986 which clearly states to use UTF-8 encoding, not Windows-1250. So let's piece together the URI manually.

    use utf8;
    use URI::Escape qw(uri_escape);
    use Encode qw(encode);
    for my $name (
        'Ján Slota',
        'Peter Kažimír',
        'Alojz Hlina',
        'František Mikloško',
        'Ján Počiatek',
    ) {
        my ($pr, $meno) = split ' ', $name;
        printf "\n",
            map { uri_escape encode('Windows-1250', $_) } $meno, $pr;
    edit: Windows-1250, not -1252. choroba++

      Just a small note. In the above example, the strings are coming from the source, so the "encode" function is used to convert them to CP1250. In the OP script, the strings are coming from external file and they are already "octet sequences", so instead of "encode" one should use "from_to".

      uri_escape Encode::from_to($_, "UTF-8", "CP1250");
      Of course, an alternative would be to specify "encoding" for the file, but current version of the script does not do it.

      Hi daxim,

      just stumbled on the following. You said: not conforming to the standard RFC 3986 which clearly states to use UTF-8 encoding...

      I'm a bit surprised about the part is not conforming. Can you direct my to the part in the RFC?

      Best regards

Re: Percent encoding of URIs with UTF-8 characters
by choroba (Chancellor) on Mar 21, 2013 at 13:28 UTC
    The portal does not use UTF-8, it uses cp1250.
    $ perl -E 'print "Ka\x9eim\xedr"' | iconv -f cp1250 -t utf8 Kažimír
    لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1024748]
Front-paged by Corion
and the monks are mute...

How do I use this? | Other CB clients
Other Users?
Others about the Monastery: (3)
As of 2017-05-27 18:09 GMT
Find Nodes?
    Voting Booth?