Re^4: Is there some universal Unicode+UTF8 switch?

Replies are listed 'Best First'.
thread drift is allowed by daxim (Curate) on Sep 02, 2019 at 12:19 UTC
if it's ok to continue in the same thread then I will continue here Thread drift is allowed. For good netiquette, also change the title in the reply form.	[reply]
Proper Unicode handling in Perl by VK (Novice) on Sep 02, 2019 at 14:00 UTC
>Thread drift is allowed. For good netiquette, also change the title in the reply form. OK then. So and first of all I am not a staff developer of Wikipedia, just one of volunteer editors. We needed a script for a set of users willing to get notifications about upcoming internal elections, acting like a daemon (checking every 24 hrs some place and notify if there is something). tools.wmflabs.org gives you anything of your choice (Perl, PHP, Python, C#, you name it) in latest stable versions. I don't like Python, have no idea about C#, remember something about Perl - so I did Perl. This is to make it clear that the list=allusers query has nothing to do with the actual task. It is only to show the exact data format to query and to expect. The full MediaWiki API help is here: https://ru.wikipedia.org/w/api.php?action=help&uselang=en Now... The script has to be able to handle Unicode/UTF-8/whatever literals in the code: so I needed use utf8; It also has to output it in HTML- so I needed binmode STDOUT, ':utf8'; It also has to receive JSON, decode it, slice it, string compare/replace and all other thing - all with Cyrillic in them. I dropped all (en\|de)coding things called in this thread unnecessary so came to: #!/usr/bin/perl use strict; use warnings; use utf8; use Encode; use LWP::UserAgent; use HTTP::Request::Common; use HTTP::Cookies; use JSON; my $browser = LWP::UserAgent->new; # they ask to use descriptive user-agent - not LWP defaults # w:ru:User:Bot_of_the_Seven = https://ru.wikipedia.org/wiki/Участник:Bot_of_the_Seven $browser->agent('w:ru:User:Bot_of_the_Seven (LWP like Gecko) We come in peace'); # I need cookies exchange enabled for auth # here is doesn't matter but to give full LWP picture: $browser->cookie_jar({}); # a very few queries can be done by GET - most of MediaWiki require POST # so I do POST all around rather then remember where GET is allowed or not: my $response = $browser->request(POST 'https://ru.wikipedia.org/w/api.php', { 'format' => 'json', 'formatversion' => 2, 'errorformat' => 'bc', 'action' => 'query', 'list' => 'allusers', 'auactiveusers' => 1, 'aulimit' => 10, 'aufrom' => 'Б' } ); my $data = decode_json($response->content); my $test_scalar = $data->{query}->{allusers}[0]->{name}; my @test_array = @{$data->{query}->{allusers}}[0..2]; display_html($test_array[1]->{name}); sub display_html { my @html = ( '<!DOCTYPE html>', '<html>', '<head>', '<meta charset="UTF-8">', '<title>Мой тест</title>', '</head>', '<body>', shift // 'Статус — ОК', # soft OR: 0 and empty string accepted '</body>', '</html>' ); # to avoid "wide character" warnings: binmode STDOUT, ':utf8'; print "Content-Type: text/html; charset=utf-8\n\n"; print join("\n", @html); } Is there anything that might go badly wrong concerning Cyrillic in Unicode/UTF-8?	[reply]
Re: Proper Unicode handling in Perl by haj (Vicar) on Sep 02, 2019 at 15:13 UTC
Nice progress! You don't even need the Encode module :) This is a pretty straightforward way to deal with Unicode and UTF-8. The remaining mentions of UTF-8 in your code have all their justification: `use utf8;` tells Perl that your source code comes with UTF-8 encoded literals. `binmode STDOUT, ':utf8';` makes Perl spit out the strings in `@html` properly UTF-8 encoded. You can encode any Unicode character in UTF-8, so no problems here. `Content-Type: text/html; charset=utf-8` tells the browser that it has to handle the byte stream as UTF-8 and decode the characters accordingly. There are two caveats: Obviously, You need to save your source code UTF-8 encoded. You must check whether the JSON data might, in some circumstances, contain characters which have a special meaning in HTML, in particular `<` and `&`. This has nothing to do with Unicode, though. I'm adding the relevant stuff to your `sub display_html`: sub display_html { use HTML::Entities; my $html_encoded = encode_entities(shift, '<>&"'); my @html = ( '<!DOCTYPE html>', '<html>', '<head>', '<meta charset="UTF-8">', '<title>Мой тест</title>', '</head>', '<body>', $html_encoded // 'Статус — ОК', # soft OR: 0 and empty string accepted '</body>', '</html>' ); # to avoid "wide character" warnings: binmode STDOUT, ':utf8'; print "Content-Type: text/html; charset=utf-8\n\n"; print join("\n", @html); }	[reply]
Re^2: Proper Unicode handling in Perl by Aldebaran (Curate) on Sep 04, 2019 at 20:14 UTC
Re^3: Proper Unicode handling in Perl by haj (Vicar) on Sep 05, 2019 at 02:23 UTC
Some notes below your chosen depth have not been shown here
Re^3: Proper Unicode handling in Perl by VK (Novice) on Sep 06, 2019 at 15:01 UTC


No such thing as a small change
	PerlMonks