Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

Re^4: Is there some universal Unicode+UTF8 switch?

by VK (Novice)
on Sep 02, 2019 at 11:21 UTC ( #11105429=note: print w/replies, xml ) Need Help??


in reply to Re^3: Is there some universal Unicode+UTF8 switch?
in thread Is there some universal Unicode+UTF8 switch?

> I am convinced that you already know "enough to be dangerous", but not enough to arrive at the correctly modelled solution that most other Perl programmers would implement.
I have to agree on that. The last time I extensively programmed something fully by myself and in Perl - in was one week before the capitulation in the Browser War, November 1998. I am actually surprised by myself to be able to write a working program in 2 days - 20 years after. It is amazing how much stuff can be kept at the backdoor of the mind... I'm fully fluent in Javascript though.
I'll do break the code into minimum test cases to check all spelled advises and corrections.
It is rather offtop for the initial question "Is there some universal Unicode+UTF8 switch?" - but if it's ok to continue in the same thread then I will continue here.

  • Comment on Re^4: Is there some universal Unicode+UTF8 switch?

Replies are listed 'Best First'.
thread drift is allowed
by daxim (Curate) on Sep 02, 2019 at 12:19 UTC
    if it's ok to continue in the same thread then I will continue here
    Thread drift is allowed. For good netiquette, also change the title in the reply form.

      >Thread drift is allowed. For good netiquette, also change the title in the reply form.
      OK then. So and first of all I am not a staff developer of Wikipedia, just one of volunteer editors. We needed a script for a set of users willing to get notifications about upcoming internal elections, acting like a daemon (checking every 24 hrs some place and notify if there is something).
      tools.wmflabs.org gives you anything of your choice (Perl, PHP, Python, C#, you name it) in latest stable versions. I don't like Python, have no idea about C#, remember something about Perl - so I did Perl.

      This is to make it clear that the list=allusers query has nothing to do with the actual task. It is only to show the exact data format to query and to expect. The full MediaWiki API help is here: https://ru.wikipedia.org/w/api.php?action=help&uselang=en

      Now... The script has to be able to handle Unicode/UTF-8/whatever literals in the code: so I needed use utf8; It also has to output it in HTML- so I needed binmode STDOUT, ':utf8';
      It also has to receive JSON, decode it, slice it, string compare/replace and all other thing - all with Cyrillic in them. I dropped all (en|de)coding things called in this thread unnecessary so came to:

      #!/usr/bin/perl
      
      use strict;
      use warnings;
      
      use utf8;
      use Encode;
      
      use LWP::UserAgent;
      use HTTP::Request::Common;
      use HTTP::Cookies;
      
      use JSON;
      
      my $browser = LWP::UserAgent->new;
      
      # they ask to use descriptive user-agent - not LWP defaults
      # w:ru:User:Bot_of_the_Seven = https://ru.wikipedia.org/wiki/Участник:Bot_of_the_Seven
      $browser->agent('w:ru:User:Bot_of_the_Seven (LWP like Gecko) We come in peace');
      
      # I need cookies exchange enabled for auth
      # here is doesn't matter but to give full LWP picture:
      $browser->cookie_jar({});
      
      # a very few queries can be done by GET - most of MediaWiki require POST
      # so I do POST all around rather then remember where GET is allowed or not:
      my $response = $browser->request(POST 'https://ru.wikipedia.org/w/api.php',
              {
                  'format' => 'json',
                  'formatversion' => 2,
                  'errorformat' => 'bc',
                      
                  'action' => 'query',
                  'list' => 'allusers',
                  'auactiveusers' => 1,
                  'aulimit' => 10,
                  'aufrom' => 'Б'
              }
          );
      
      my $data = decode_json($response->content);
      
      my $test_scalar = $data->{query}->{allusers}[0]->{name};
      
      my @test_array = @{$data->{query}->{allusers}}[0..2];
      
      display_html($test_array[1]->{name});
      
      
      sub display_html {
      
          my @html = (
              '<!DOCTYPE html>',
              '<html>',
              '<head>',
              '<meta charset="UTF-8">',
              '<title>Мой тест</title>',
              '</head>',
              '<body>',
              shift // 'Статус  ОК', # soft OR: 0 and empty string accepted
              '</body>',
              '</html>'
          );
          
          # to avoid "wide character" warnings:
          binmode STDOUT, ':utf8';
          
          print "Content-Type: text/html; charset=utf-8\n\n";
          
          print join("\n", @html);
      }
      

      Is there anything that might go badly wrong concerning Cyrillic in Unicode/UTF-8?

        Nice progress! You don't even need the Encode module :)

        This is a pretty straightforward way to deal with Unicode and UTF-8.

        The remaining mentions of UTF-8 in your code have all their justification:

        • use utf8; tells Perl that your source code comes with UTF-8 encoded literals.
        • binmode STDOUT, ':utf8'; makes Perl spit out the strings in @html properly UTF-8 encoded. You can encode any Unicode character in UTF-8, so no problems here.
        • Content-Type: text/html; charset=utf-8 tells the browser that it has to handle the byte stream as UTF-8 and decode the characters accordingly.

        There are two caveats:

        • Obviously, You need to save your source code UTF-8 encoded.
        • You must check whether the JSON data might, in some circumstances, contain characters which have a special meaning in HTML, in particular < and &. This has nothing to do with Unicode, though. I'm adding the relevant stuff to your sub display_html:
          sub display_html {
              use HTML::Entities;
              my $html_encoded = encode_entities(shift, '<>&"');
              my @html = (
                  '<!DOCTYPE html>',
                  '<html>',
                  '<head>',
                  '<meta charset="UTF-8">',
                  '<title>Мой тест</title>',
                  '</head>',
                  '<body>',
                  $html_encoded // 'Статус  ОК', # soft OR: 0 and empty string accepted
                  '</body>',
                  '</html>'
              );
              
              # to avoid "wide character" warnings:
              binmode STDOUT, ':utf8';
              
              print "Content-Type: text/html; charset=utf-8\n\n";
              
              print join("\n", @html);
          }
          

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://11105429]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others imbibing at the Monastery: (7)
As of 2020-04-05 01:50 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    The most amusing oxymoron is:
















    Results (33 votes). Check out past polls.

    Notices?