Beefy Boxes and Bandwidth Generously Provided by pair Networks vroom
Keep It Simple, Stupid
 
PerlMonks  

Re^4: Unicode substitution regex conundrum

by Polyglot (Monk)
on Oct 17, 2007 at 03:35 UTC ( #645353=note: print w/ replies, xml ) Need Help??


in reply to Re^3: Unicode substitution regex conundrum
in thread Unicode substitution regex conundrum

Well, I'm stumped. The program will match the spaces properly when doing a split// but not when doing a s///. I have now set the attribute on the HTML form to UTF8. I have inserted the code posted earlier, and still nothing. So, to demonstrate the exact conundrum I am up against, I have reduced my code to just the barest essentials for testing this UTF8 regex.

Please feel free to try this script on your own server to see if you can get it to work properly on Chinese fonts. I have included a sample Chinese phrase in the script which you should be able to copy and paste into it for testing purposes. Compare it with an English search, and you'll see why I'm frustrated!

#!/usr/bin/perl -wT -CE use Encode; use Encode qw(_utf8_on); use Encode qw(encode decode); ##### PARSE THE FORM INPUT if ($ENV{CONTENT_LENGTH}) { read(STDIN, $buffer, $ENV{CONTENT_LENGTH}); @pairs = split(/&/,$buffer); } else { $buffer = $ENV{QUERY_STRING}; @pairs = split(/\+/,$buffer); } foreach $pair (@pairs) { ($name, $value) = split(/=/,$pair); $value =~ tr/+/ /; $value =~ s/%([a-fA-F0-9][a-fA-F0-9])/pack("C",hex($1))/eg; $input{$name} = $value; } $terms=$input{terms}; ##### START TESTING PHASE print "Content-type: text/html\n\n"; print "TERMS: $terms"; ##### TRY A SPLIT ($a, $b, $c, $d, $e, $f) = split/\p{IsSpace}/, $terms; print "<p>A:$a:<p>B:$b:<p>C:$c:<p>D:$d:<p>E:$e:<p>F:$f:\n"; ##### NOW TRY A SUBSTITUTION $word = qr/\b(?!(?:AND|OR|XOR|NOT)\b)\w+/i; $terms =~ s/($word)\p{IsSpace}*($word)/$1 AND $2/g for 1..2; print "<p>Terms:$terms\n"; ##### PRINT THE WEBPAGE print <<HTML; <html lang=utf8> <head> <META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=utf8"> <title>SEARCH</title> </head><body> <h1 align="center">Search</h1> <form name="ff" method="POST" accept-encoding="UTF-8" accept-charset="utf-8" action="$0"> Search terms: <input type="text" size="40" name="terms" value="$terms"></input> <p>An example Chinese phrase: &#32102;&#32842; &#22825;&#23460;&#25152 +; &#26377;&#25104;&#21729; <input type="submit" name="submit" value="Submit"></input> </form></body></html> HTML


Comment on Re^4: Unicode substitution regex conundrum
Download Code
Re^5: Unicode substitution regex conundrum
by Lu. (Hermit) on Dec 16, 2007 at 22:39 UTC
    Hi, I may be too late, seeing your message has been here since nearly two months, but it could still be of use to you or someone else.

    I may be wrong, but it seems to me like the problem does not reside with the whitespaces, but with the definition of word in Perl : \w+ does not match chinese characters.

    On my system (with unicode locale and chinese readable in the console) :
    $ perl -le 'print "ok" if ("&#25105;&#36208;" =~ m/\w+/)' $ perl -le 'print "ok" if ("hi" =~ m/\w+/)' ok
    (Chinese chars were jumbled, I didn't put the codes in the one-liner)

    Furthermore, I played a bit with your code, and when I replaced
    $terms =~ s/($word)\p{IsSpace}*($word)/$1 AND $2/g for 1..2;
    with
    $terms =~ s/(\p{IsSpace}/ AND /g;
    it did the job I expected of it.

    The quickest workaround I see at the moment would be to declare $word using CJK character ranges instead of \w.

    Hope I could be of help.

    Lu.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://645353]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others pondering the Monastery: (12)
As of 2014-04-16 16:58 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    April first is:







    Results (433 votes), past polls