Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Percent encoding of URIs with UTF-8 characters

by j0se (Pilgrim)
on Mar 21, 2013 at 13:02 UTC ( #1024748=perlquestion: print w/ replies, xml ) Need Help??
j0se has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks,

I have a list of UTF-8 encoded names in a file:

$ file -i names.txt names.txt: text/plain charset=utf-8 $ cat names.txt Ján Slota Peter Kažimír Alojz Hlina František Mikloško Ján Počiatek

I want to check whether these names are associated with some Slovak companies. I want to do it by running this script against the Business register (there is no API, AFAIK). The problem is, I'm not getting the expected results for all names (the script works just for the third name/line in the file). I guess it is because of the URI encoding done by URI::Encode (line 26, 27 in the script) - for example for the second name from the file I get:

http://www.orsr.sk/hladaj_osoba.asp?PR=Ka%C3%85%C2%BEim%C3%83%C2%ADr&M +ENO=Peter&SID=0&T=f0&R=on
and the portal is expecting (I get this by filling in the form on the portal):
http://www.orsr.sk/hladaj_osoba.asp?PR=Ka%9Eim%EDr&MENO=Peter&SID=0&T= +f0&R=on
I read I shouldn't even need to use URI::Encode most of the time. I have tried without it and with URI::Escape - without success. Can you show me the way? Thanks.

Excellence is an art won by training and habituation: we do not act rightly because we have virtue or excellence, but we rather have these because we have acted rightly. -- Will Durant

Comment on Percent encoding of URIs with UTF-8 characters
Select or Download Code
Re: Percent encoding of URIs with UTF-8 characters
by daxim (Chaplain) on Mar 21, 2013 at 13:27 UTC
    The URI module does not help you because orsr.sk is not conforming to the standard RFC 3986 which clearly states to use UTF-8 encoding, not Windows-1250. So let's piece together the URI manually.

    use utf8;
    use URI::Escape qw(uri_escape);
    use Encode qw(encode);
    
    for my $name (
        'Ján Slota',
        'Peter Kažimír',
        'Alojz Hlina',
        'František Mikloško',
        'Ján Počiatek',
    ) {
        my ($pr, $meno) = split ' ', $name;
        printf "http://www.orsr.sk/hladaj_osoba.asp?PR=%s&MENO=%s&SID=0&R=on\n",
            map { uri_escape encode('Windows-1250', $_) } $meno, $pr;
    }
    
    __END__
    http://www.orsr.sk/hladaj_osoba.asp?PR=Slota&MENO=J%E1n&SID=0&R=on
    http://www.orsr.sk/hladaj_osoba.asp?PR=Ka%9Eim%EDr&MENO=Peter&SID=0&R=on
    http://www.orsr.sk/hladaj_osoba.asp?PR=Hlina&MENO=Alojz&SID=0&R=on
    http://www.orsr.sk/hladaj_osoba.asp?PR=Miklo%9Ako&MENO=Franti%9Aek&SID=0&R=on
    http://www.orsr.sk/hladaj_osoba.asp?PR=Po%3Fiatek&MENO=J%E1n&SID=0&R=on
    
    edit: Windows-1250, not -1252. choroba++

      Just a small note. In the above example, the strings are coming from the source, so the "encode" function is used to convert them to CP1250. In the OP script, the strings are coming from external file and they are already "octet sequences", so instead of "encode" one should use "from_to".

      uri_escape Encode::from_to($_, "UTF-8", "CP1250");
      Of course, an alternative would be to specify "encoding" for the file, but current version of the script does not do it.

      Hi daxim,

      just stumbled on the following. You said:

      ...is not conforming to the standard RFC 3986 which clearly states to use UTF-8 encoding...

      I'm a bit surprised about the part is not conforming. Can you direct my to the part in the RFC?

      Best regards
      McA

Re: Percent encoding of URIs with UTF-8 characters
by choroba (Abbot) on Mar 21, 2013 at 13:28 UTC
    The portal does not use UTF-8, it uses cp1250.
    $ perl -E 'print "Ka\x9eim\xedr"' | iconv -f cp1250 -t utf8 Kažimír
    لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1024748]
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others having an uproarious good time at the Monastery: (2)
As of 2014-09-21 00:26 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (165 votes), past polls