Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Unicode words match and catch

by kepler (Scribe)
on Apr 14, 2016 at 14:20 UTC ( #1160403=perlquestion: print w/replies, xml ) Need Help??

kepler has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I'm trying to do a routine which can catch in a string all the words with an unicode for hebrew, greek and arabic. I'm trying to place those matches in an array. Then, item by item, I want to create a new array with the html hexadecimal entities of each word. I must admit I'm a bit lost here. Also, in the string, there might exist regular words in latin/english. Can someone please advise? Kind regards, Kepler

Replies are listed 'Best First'.
Re: Unicode words match and catch
by Your Mother (Archbishop) on Apr 14, 2016 at 15:17 UTC

    While this is a confusing topic, at the root it's not too hard as long as you know the input encoding and adjust for the output. UTF-8 probably covers all the characters you want. You will have to do quite a bit of reading to understand what you're doing here, though. This is the nuclear option for explanations: tchrist on UTF-8 in Perl. Normally the only parts you really have to understand are, decode your input to UTF-8, do your business in Perl, encode your output to UTF-8 (on in your case, ASCII HTML entities). And another basic caveat. If you expect to be able to see UTF-8 in a display layer like a terminal, the layer has to be aware of the encoding you want to use. Unicode is the catch-all for all encodings in the standard. You will be dealing with *an* encoding at input and *an* encoding at output that may or may not be the same; Latin-1, CP-1252, UTF-8, UTF-16, Big5, etc.

    This little snippet might get you started. I had to use <pre/> tags because PM's <code/> tags don't like wide characters. :P

    use utf8;
    use strictures;
    use HTML::Entities "encode_entities_numeric";
    
    binmode STDOUT, ":encoding(UTF-8)";
    # OR use Encode, print encode_utf8(...)
    
    while (<DATA>)
    {
        chomp;
        next unless /\w/;
        print $_, $/;
        print "  -> ",  length, " characters long", $/;
        print "  -> ", encode_entities_numeric($_), $/;
    }
    
    __DATA__
    antennŠ
    עברית
    Ελληνικά
    العَرَبِية‎
    
    antennŠ
      -> 7 characters long
      -> antenn&#xE6;
    עברית
      -> 5 characters long
      -> &#x5E2;&#x5D1;&#x5E8;&#x5D9;&#x5EA;
    Ελληνικά
      -> 8 characters long
      -> &#x395;&#x3BB;&#x3BB;&#x3B7;&#x3BD;&#x3B9;&#x3BA;&#x3AC;
    العَرَبِية‎
       -> 11 characters long
       -> &#x627;&#x644;&#x639;&#x64E;&#x631;&#x64E;&#x628;&#x650;&#x64A;&#x629;&#x200E;
    

    Further reading: Encode, utf8, perlunitut. Branch out from those as desired.

Re: Unicode words match and catch
by Corion (Pope) on Apr 14, 2016 at 14:24 UTC

    Wouldn't HTML::Entities fit the bill already, without the recognition of the particular alphabets?

Re: Unicode words match and catch
by graff (Chancellor) on Apr 15, 2016 at 02:52 UTC
    Adding to Your Mother's excellent advice above, you'll love the predefined unicode character classes for the various scripts. Here's a minor enhancement to the script provided above (again, using "pre" tags to avoid the mangling of non-ascii characters):
    #!/usr/bin/perl
    
    use utf8;
    use strictures;
    use HTML::Entities "encode_entities_numeric";
    
    binmode STDOUT, ":encoding(UTF-8)";
    # OR use Encode, print encode_utf8(...)
    
    while (<DATA>)
    {
        chomp;
        next unless /\w/;
        my $script_label = "";
        for my $script ( qw/Arabic Greek Hebrew/ ) {
            $script_label .= " has $script" if ( /\p{$script}/ );
        }
        print $_, $/;
        print "  -> ",  length, " characters long; $script_label", $/;
        print "  -> ", encode_entities_numeric($_), $/;
    }
    
    __DATA__
    antennŠ
    עברית
    Ελληνικά
    العَرَبِية
    
    The output I got from that was:
    antennŠ
      -> 7 characters long; 
      -> antennæ
    עברית
      -> 5 characters long;  has Hebrew
      -> עברית
    Ελληνικά
      -> 8 characters long;  has Greek
      -> Ελληνικά
    العَرَبِية
      -> 10 characters long;  has Arabic
      -> العَرَبِية
    
    To put that another way, you can match and store strings of characters in particular, language-specific scripts with something like this:
    # Assuming $_ contains the input: my @hebrew_parts = /\p{Hebrew}+/g; my @arabic_parts = /\p{Arabic}+/g; my @greek_parts = /\p{Greek}+/g;
    Similarly for Han, Cyrillic, Ethiopic, Thai, Devanagari, etc. (As shown above, you have the option of parameterizing the script label as a loop variable.)

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1160403]
Approved by ww
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chanting in the Monastery: (4)
As of 2021-04-22 02:06 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?