Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

Re: Regular expressions and accents

by graff (Chancellor)
on Dec 22, 2004 at 04:31 UTC ( #416690=note: print w/replies, xml ) Need Help??


in reply to Regular expressions and accents

Larry Wall recently posted this nifty little script on the perl-unicode mail list -- here it is, pretty much verbatim (I added the "S" on the shebang line, to make STDIN/STDOUT/STDERR be utf8):
#!/usr/bin/perl -CS $pat = shift; if (ord $pat > 256) { $pat = sprintf("%04x", ord $pat); } elsif (ord $pat > 128) { # arg in sneaky UTF-8 $pat = sprintf("%04x", unpack("U0U",$pat)); } @names = split /^/, do 'unicore/Name.pl'; for (@names) { if (/$pat/io) { $hex = hex($_); print chr($hex),"\t",$_; } }
The idea is to output a list of unicode code points (if any) that match any given expression you put into  $ARGV[0] -- here's a relevant command-line usage example (Larry had this script in a file named "uni"):
uni "latin (?:small|capital) letter A with"
(update: if you try this, you'll want to be running in a terminal window that handles utf8 characters!)

So, all you need for what you want is the part that assigns the output of "unicode/Name.pl" to an array -- this gives you the unicode character database -- and grep through the array to get the set of vowels you want. Then, put the first token (first character in each array element is the utf8 character itself) into a character-class expression. Something like:

my @names = split/^/, do 'unicore/Name.pl'; #... my @vowelsets; for my $v ( qw/A E I O U/ ) { push( @vowelsets, join( '', map { chr hex( substr $_, 0, 4 ) } grep /LATIN (?:SMALL|CAPITAL) LETTER $v/, @names )); } # now you can use each element of @vowelsets as a character class # (similiarly for consonants...)
(updated this snippet: changed the map block from a regex to substr; updated a second time to use "chr hex()" in the map block -- each element of @names begins with a four-digit hex code-point value, which needs to be converted to a character.)

Still a bit cumbersome, I suppose, but quite manageable and not that bulky.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://416690]
help
Chatterbox?
[Corion]: Yeah, I'm thinking more of marketing mailing lists, not public broadcast-style mailing lists
[marto]: yeah, so our hackerspace, we run mailman. that's a public discussion list, not a weekly buy our crap marketing list, people can unsubscribe at any time. What they can't do is delete their mails from the archive, or from the inboxes of our hundreds of user
[Corion]: marto: I'm not sure on how to treat mail archives. I think you could either set an auto-deletion timespan or an auto-anonymisation timespan if you wanted to do it right.
[Corion]: For PM, I think we'll create a user "gdpr", who gets assigned the nodes of users who want to give up their user status here.

How do I use this? | Other CB clients
Other Users?
Others chanting in the Monastery: (9)
As of 2018-05-22 08:24 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Notices?