Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

Re: String match in Chinese character

by choroba (Bishop)
on Mar 11, 2018 at 22:26 UTC ( #1210689=note: print w/replies, xml ) Need Help??


in reply to String match in Chinese character

It works for me the same way as with "English" characters. Just don't forget to tell Perl that you want to read UTF-8 from the source, input files, or use it for output.
#! /usr/bin/perl
use warnings;
use strict;
use utf8;
 
my $string = '看【厂家直销 儿童加绒加厚打底裤 中小童冬季】Ib';
 
binmode STDOUT, ':encoding(UTF-8)';
while ($string =~ /【(.*?)】/g) {
    print "Match: $string\n";
}
($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,

Replies are listed 'Best First'.
Re^2: String match in Chinese character
by hankcoder (Scribe) on Mar 11, 2018 at 22:43 UTC

    Thanks choroba for reply. The suggested method is as my very old way of coding style. However, I will encounter an issue with my editor whereby I must have the .pl file saved in UTF-8 encoded type in order to hard code it like while ($string =~ /【(.*?)】/g)

    Furthermore, there are some other issue which I can't remember so I ditch such direct use of Chinese character in my codes.

      then escape the unicode in the regex:

      use warnings;
      use strict;
      use utf8;
      my $string = '看【厂家直销 儿童加绒加厚打底裤 中小童冬季】Ib';
      
      binmode STDOUT, ':encoding(UTF-8)';
      while ($string =~ /\x{3010}(.*?)\x{3011}/g) {
          print "Match: $string\n";
      }
      

      (I left the use utf8 so that I could easily include the same string that choroba did. However you get the string is fine)

        Or use the name of the character if you don't have the codes on the tip of your tongue:

        use warnings;
        use strict;
        use utf8;
        use charnames ':full';
        
        my $string = '看【厂家直销 儿童加绒加厚打底裤 中小童冬季】Ib';
        
        binmode STDOUT, ':encoding(UTF-8)';
        print "Match: $string\n" while $string =~ /
            \N{LEFT BLACK LENTICULAR BRACKET} (.*?) \N{RIGHT BLACK LENTICULAR BRACKET}
        /gx;
        


        The way forward always starts with a minimal test.
Re^2: String match in Chinese character
by hankcoder (Scribe) on Mar 12, 2018 at 11:44 UTC

    Thank you for all the help guys. I have just noticed the Chinese characters were screwed up during FORM submit encoding where "【" should be 12304 when encoded but it become splited into 3 parts: 227,128,144. For the moment, I yet found out how to join up "227,128,144" into "12304". I have narrowed down to FORM URI Safe encoding causes this. My current test codes become too messy to post here. If anyone got any idea, I would be really appreciate if could point out the most possible cause of this.

    For the moment, I use Javascript function to ".charCodeAt" before form submit to make each encoded character look like "【" for "【" then only I can use match string in Perl to extract strings inside "【" and "】".

    Incase you guys interested in the JS, here is the code:

    function encodeCN(id) { var tstr = document.getElementById(id).value; var bstr = ''; for(i=0; i<tstr.length; i++) { if(tstr.charCodeAt(i)>127) { bstr += '&#' + tstr.charCodeAt(i) + ';'; } else { bstr += tstr.charAt(i); } } document.getElementById(id).value = bstr; }

      You will most likely either need to look at the Content-Type header of your form submission request or, if that fails, guess. I think in the past browsers used to submit form data in the same encoding as the page the HTML form was on, but I hope that nowadays with fairly recent browsers, they always send the content characterset/encoding with the request:

      Content-Type: text/html; charset=utf-8

      Ideally, your framework already looks at that and uses the appropriate Encode::decode call, but I'm not sure what your users browsers actually send and whether that can be decoded without problems. Maybe seeing some more of the code that receives the input and of the HTML that is used to display the FORM to the client can help us narrow the problem down somewhat.

      where "【" should be 12304 when encoded but it become splited into 3 parts: 227,128,144

      Yes, that is properly encoded UTF-8. Codepoints from U+0800 to U+FFFF are to be encoded with 3 bytes. The codepoint 12304, which is 0x3010 in hex, usually using the U+3010 notation for Unicode, should be encoded as the three bytes 0xE3, 0x80, 0x90. Working it out:

      Codepoint 12304 (codepoint 0x3010 in hex, U+3010)
      
      hex 0x3010
      hex 3    0    1    0
      bin 0011 0000 0001 0000
          xxxx yyyy yyzz zzzz     (use x, y, and z to indicate the groups of bits in the codepoints)
      
      encoding:
          ....xxxx ..yyyyyy ..zzzzzz  (use xyz as above; use dots . to indicate bits specified in UTF-8 encoding)
      bin 11100011 10000000 10010000
      
      hex E3       80       90
      dec 227      128      144
      
      ... which is what you listed

      (This is, btw, why Corion told you to look for the charset=utf-8 in the Content-type, because he recognized those three bytes were the appropriate UTF-8 encoding of the LEFT BLACK LENTICULAR BRACKET (U+3010) )

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1210689]
help
Chatterbox?
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others surveying the Monastery: (4)
As of 2018-04-20 22:49 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Notices?