comment on

Hi, I have run out of ideas how to workaround of doing the string/text match between 2 character in Chinese characters. If in English character, I can easily find the match with $content =~ m/\[(.*?)\]/sig but how I can do the match when working with Chinese characters?

I'm not very sure which method is best approach in doing so. Use Base64 encode or Hex encode or others encoding?

Example:

....看【厂家直销儿童加绒加厚打底裤中小童冬季】Ib.....

To match any strings in between 【....】

My current writing practice is either using Base64 or Hex encode for especially non-English input strings then store in files.

When I'm writing this question, I see perlmonks.org encode my Chinese characters into 【 for [ and 】 for ]. I forgot what this encoding is called as haven't use it for very long time. Will this be the best method to do so?

** UPDATE TO THE INITIAL QUESTION **

I updated here my latest finding whereby my actual problem is due to encoding during Form Submit not in string match. I copy my followup post here or your can direct go to (http://www.perlmonks.org/?node_id=1210729). If it is more appropriate to ask in New and separate post, do let me know and I do that. Thanks.

Ok, here is the cleaner self contain Perl script with inline FORM submit. Do make sure the form action value is "utf8_encode.pl" or change to your desire. For direct test, use this "【" Chinese character for example. For the result #3, I use unpack for this purpose. Previously found several ways and they give same results where single Chinese char "【" when split will become 3 char 227,128,144.

I'm still not quite understand of the explaination given. Almost getting the hand of it.

If I can get the encoding solved, then I think I should be able to get the Decode working as well.

The string match will be in separate processing where the code looks like this my ($result) = $str =~ m/\&\#12304\;(.*?)\&\#12305\;/sig;

#!/usr/bin/perl
######################################################################
+##########
#
#
######################################################################
+##########
use CGI ':standard';
use HTML::Entities; #-- for encode and decode string


(%FORM) = ();
if ($ENV{'REQUEST_METHOD'} eq "POST")
{
    my ($id);
    #-- extract the value inside param into %FORM hash
    foreach $id (param)
        {
            $FORM{$id} = param($id);
        }
} # // if post

print "Content-Type: text/html; charset=utf-8\n\n";

print "<h2>Encode UTF-8 Chinese Character Input</h2><br>";

print &input_form;


#---------------------------------------------------#
#---------------------------------------------------#
sub input_form
{
    my ($content) = "";
    
    my ($value) = "";
    if ($FORM{'data'} ne "")
        {
            $value = $FORM{'data'};
        }
        
    my ($encoded_value) = "";
    my ($process_content) = "";
    
    if ($FORM{'action'} eq "encode")
        {
            $encoded_value = $FORM{'encoded_value'};
            
            # !! attempt to do encoding inside perl but the $FORM{'dat
+a'} when split,
            # it become 3 char for Chinese char !!
            my (@arr) = split(//,$FORM{'data'});
            foreach my $c (@arr)
                {
                    $c = unpack('C*', $c);
                    $process_content .= "$c\n";
                }
        }
    elsif ($FORM{'action'} eq "decode")
        {
        }
        
    
    #-- content ---------------------------------
    $content = qq~
    <script type="text/javascript">
    function encodeCN(id) {
        var tstr = document.getElementById(id).value;
        var bstr = '';
        for(i=0; i<tstr.length; i++)
        {
            if(tstr.charCodeAt(i)>127)
            {
                bstr += '&#' + tstr.charCodeAt(i) + ';';
            }
            else
            {
                bstr += tstr.charAt(i);
            }
        }
        document.getElementById('encoded_value').value = bstr;
    }
    </script>
    
    <form id="fr_in" name="fr_in" action="utf8_encode.pl" style="" met
+hod="POST" enctype="application/x-www-form-urlencoded">
    <input type="hidden" onFocus="this.blur()" name="convert" id="conv
+ert" value="">
    <input type="hidden" onFocus="this.blur()" name="action" id="actio
+n" value="">
    <input type="hidden" name="encoded_value" id="encoded_value" value
+="">
    
    <textarea id="data" name="data" style="width:600px; height:200px;"
+>$value</textarea>    
    <br>
<xmp>
1. FORM submitted value:
$value

2. Encoded value thru JS before form submit:
$encoded_value

3. *Try to do encoding inside Perl*
$process_content
</xmp>
    

    <input type="button" value="Encode" onClick="encodeCN('data'); doc
+ument.getElementById('action').value='encode'; this.form.submit();">
    <input type="button" value="Decode" onClick="document.getElementBy
+Id('action').value='decode'; this.form.submit();">
    </form>
    ~;
    #--// content -------------------------------
    
    return ($content);
}
[download]

In reply to String match in Chinese character by hankcoder

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


We don't bite newbies here... much
	PerlMonks