Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

Re: Regular expressions and accents

by graff (Chancellor)
on Dec 22, 2004 at 04:31 UTC ( #416690=note: print w/ replies, xml ) Need Help??


in reply to Regular expressions and accents

Larry Wall recently posted this nifty little script on the perl-unicode mail list -- here it is, pretty much verbatim (I added the "S" on the shebang line, to make STDIN/STDOUT/STDERR be utf8):

#!/usr/bin/perl -CS $pat = shift; if (ord $pat > 256) { $pat = sprintf("%04x", ord $pat); } elsif (ord $pat > 128) { # arg in sneaky UTF-8 $pat = sprintf("%04x", unpack("U0U",$pat)); } @names = split /^/, do 'unicore/Name.pl'; for (@names) { if (/$pat/io) { $hex = hex($_); print chr($hex),"\t",$_; } }
The idea is to output a list of unicode code points (if any) that match any given expression you put into  $ARGV[0] -- here's a relevant command-line usage example (Larry had this script in a file named "uni"):
uni "latin (?:small|capital) letter A with"
(update: if you try this, you'll want to be running in a terminal window that handles utf8 characters!)

So, all you need for what you want is the part that assigns the output of "unicode/Name.pl" to an array -- this gives you the unicode character database -- and grep through the array to get the set of vowels you want. Then, put the first token (first character in each array element is the utf8 character itself) into a character-class expression. Something like:

my @names = split/^/, do 'unicore/Name.pl'; #... my @vowelsets; for my $v ( qw/A E I O U/ ) { push( @vowelsets, join( '', map { chr hex( substr $_, 0, 4 ) } grep /LATIN (?:SMALL|CAPITAL) LETTER $v/, @names )); } # now you can use each element of @vowelsets as a character class # (similiarly for consonants...)
(updated this snippet: changed the map block from a regex to substr; updated a second time to use "chr hex()" in the map block -- each element of @names begins with a four-digit hex code-point value, which needs to be converted to a character.)

Still a bit cumbersome, I suppose, but quite manageable and not that bulky.


Comment on Re: Regular expressions and accents
Select or Download Code

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://416690]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others musing on the Monastery: (16)
As of 2015-07-30 21:00 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (273 votes), past polls