Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

Re: Regular expressions and accents

by graff (Chancellor)
on Dec 22, 2004 at 04:31 UTC ( [id://416690]=note: print w/replies, xml ) Need Help??


in reply to Regular expressions and accents

Larry Wall recently posted this nifty little script on the perl-unicode mail list -- here it is, pretty much verbatim (I added the "S" on the shebang line, to make STDIN/STDOUT/STDERR be utf8):
#!/usr/bin/perl -CS $pat = shift; if (ord $pat > 256) { $pat = sprintf("%04x", ord $pat); } elsif (ord $pat > 128) { # arg in sneaky UTF-8 $pat = sprintf("%04x", unpack("U0U",$pat)); } @names = split /^/, do 'unicore/Name.pl'; for (@names) { if (/$pat/io) { $hex = hex($_); print chr($hex),"\t",$_; } }
The idea is to output a list of unicode code points (if any) that match any given expression you put into  $ARGV[0] -- here's a relevant command-line usage example (Larry had this script in a file named "uni"):
uni "latin (?:small|capital) letter A with"
(update: if you try this, you'll want to be running in a terminal window that handles utf8 characters!)

So, all you need for what you want is the part that assigns the output of "unicode/Name.pl" to an array -- this gives you the unicode character database -- and grep through the array to get the set of vowels you want. Then, put the first token (first character in each array element is the utf8 character itself) into a character-class expression. Something like:

my @names = split/^/, do 'unicore/Name.pl'; #... my @vowelsets; for my $v ( qw/A E I O U/ ) { push( @vowelsets, join( '', map { chr hex( substr $_, 0, 4 ) } grep /LATIN (?:SMALL|CAPITAL) LETTER $v/, @names )); } # now you can use each element of @vowelsets as a character class # (similiarly for consonants...)
(updated this snippet: changed the map block from a regex to substr; updated a second time to use "chr hex()" in the map block -- each element of @names begins with a four-digit hex code-point value, which needs to be converted to a character.)

Still a bit cumbersome, I suppose, but quite manageable and not that bulky.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://416690]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others wandering the Monastery: (3)
As of 2025-03-27 04:03 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    When you first encountered Perl, which feature amazed you the most?










    Results (69 votes). Check out past polls.

    Notices?
    erzuuliAnonymous Monks are no longer allowed to use Super Search, due to an excessive use of this resource by robots.