Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

How can I match strings with multibyte characters?

by faq_monk (Initiate)
on Oct 08, 1999 at 00:25 UTC ( #677=perlfaq nodetype: print w/ replies, xml ) Need Help??

Current Perl documentation can be found at perldoc.perl.org.

Here is our local, out-dated (pre-5.6) version:

This is hard, and there's no good way. Perl does not directly support wide characters. It pretends that a byte and a character are synonymous. The following set of approaches was offered by Jeffrey Friedl, whose article in issue #5 of The Perl Journal talks about this very matter.

Let's suppose you have some weird Martian encoding where pairs of ASCII uppercase letters encode single Martian letters (i.e. the two bytes ``CV'' make a single Martian letter, as do the two bytes ``SG'', ``VS'', ``XX'', etc.). Other bytes represent single characters, just like ASCII.

So, the string of Martian ``I am CVSGXX!'' uses 12 bytes to encode the nine characters 'I', ' ', 'a', 'm', ' ', 'CV', 'SG', 'XX', '!'.

Now, say you want to search for the single character /GX/. Perl doesn't know about Martian, so it'll find the two bytes ``GX'' in the ``I am CVSGXX!'' string, even though that character isn't there: it just looks like it is because ``SG'' is next to ``XX'', but there's no real ``GX''. This is a big problem.

Here are a few ways, all painful, to deal with it:

   $martian =~ s/([A-Z][A-Z])/ $1 /g; # Make sure adjacent ``martian'' bytes
                                      # are no longer adjacent.
   print "found GX!\n" if $martian =~ /GX/;

Or like this:

   @chars = $martian =~ m/([A-Z][A-Z]|[^A-Z])/g;
   # above is conceptually similar to:     @chars = $text =~ m/(.)/g;
   #
   foreach $char (@chars) {
       print "found GX!\n", last if $char eq 'GX';
   }

Or like this:

   while ($martian =~ m/\G([A-Z][A-Z]|.)/gs) {  # \G probably unneeded
       print "found GX!\n", last if $1 eq 'GX';
   }

Or like this:

   die "sorry, Perl doesn't (yet) have Martian support )-:\n";

In addition, a sample program which converts half-width to full-width katakana (in Shift-JIS or EUC encoding) is available from CPAN as

There are many double- (and multi-) byte encodings commonly used these days. Some versions of these have 1-, 2-, 3-, and 4-byte characters, all mixed.

Log In?
Username:
Password:

What's my password?
Create A New User
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others rifling through the Monastery: (8)
As of 2015-07-06 23:39 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (85 votes), past polls