Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

split text into words -- Unicode problem (I guess)

by bogdan77 (Initiate)
on Mar 29, 2007 at 13:13 UTC ( [id://607244]=perlquestion: print w/replies, xml ) Need Help??

bogdan77 has asked for the wisdom of the Perl Monks concerning the following question:

Hi Perl monks, I have a problem involving the split function applied to a text containing many different Unicode variations of A, B, C...Z. I want to get the list of words of any text I feed it, including the ones that have "®" or "™" attached to a word. Unfortunately, the regex I got so far also attaches "…" or "†" (dagger) to the words, so that I get "Canal†", "old..." and “ kraal¹ “ instead of simply "Canal", "old" and “kraal”. Here's the script:
#!/usr/bin/perl usewarnings; $textBlock="Belgium is a monarchy. Ionuț, close the door. The bui +ldings in Japan. This is a sky-blue* \"material\". Here's the list: The Eire Canal† is old… Is Apple™ only a computer company? www.fjydhjfjxerhuir.com is the site — go there. old-man-of-the-woods (forget-me-not) [digital] bogus_address99@geemail.com She's there. replace «date» They live in a kraal¹. façade Mac OS X® is slated to ship in May. Überzone șăâțî http://language.perl.com/faq"; @wordList=split/[^-A-Z®À-ÖØ-öù-žǍ-ȚḀ-ẛẠ-&# +7929;™]+/i, $textBlock; foreach (@wordList) { print "$_\n"; }
As you can see, I try to determine the delimiters needed for the split to be anything else except the characters included in this interval --> "-A-Z®À-ÖØ-öù-žǍ-ȚḀ-ẛẠ-ỹ™", but I still get unwanted punctuation characters added to words... There's no overlaping between these characters (i.e., they are in the proper order, with "-" first at 0045 and "™" (the trademark symbol) last at 8482, if i remember correctly :^) So... what am I doing wrong? Please enlighten me... (I have Perl 5.8.8 on Mac OS X, if that matters)

Replies are listed 'Best First'.
Re: split text into words -- Unicode problem (I guess)
by andye (Curate) on Mar 29, 2007 at 14:13 UTC
    So to look at it another way, the only thing you do want to split on is whitespace; is that right?

    If so, take a look at '\s' in perlre.

    HTH! andye

Re: split text into words -- Unicode problem (I guess)
by dk (Chaplain) on Mar 29, 2007 at 14:26 UTC
    ї is not a valid character in perl; you need \x{1111}. Also, look for unicode character properties in perlunicode , you should probably find \p{L} class useful in regexes:
    $_ = "a\x{2625}\x{10000}"; print map { sprintf "%x\n", ord } m/(\p{L})/g; 61 10000
      @andye Um... no. Simply using space as delimiter would give me "monarchy.", "Japan.", "company?", and so on, not just the words themselves. @dk The text in the variable textBlock doesn't contain "ї" constructs -- it contains the real characters (for example, ț is "T with comma below"). The html encoding changed those characters into "ї" constructs when I submitted them.
        Argh... let me try again... @dk The text in the variable textBlock doesn't contain &#(numbers); constructs -- it contains the real characters (for example, &#five3hree9ine; is "T with comma below"). The html encoding changed those characters into &#(number); constructs when I submitted them.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://607244]
Approved by marto
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others about the Monastery: (5)
As of 2024-03-29 10:34 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found