Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

To find Cyrillic characters - unicode

by Anonymous Monk
on Aug 03, 2007 at 04:35 UTC ( #630446=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi monks, I want to find out cyrillic characters in the file. Cyrillic characters ranges from 0400-04FF. XML file contains tag <cd></cd>. Script should validate that cd element contains only cyrillic characters. If it contains other character set, it should prompt an error. I've tried following code.
use encode; use Encode::Unicode;
while($val=~/<cd>(.*?)<\/cd>/gsi){ my $no = decode_utf8($1); binmode STDOUT, ":utf8";}
But i do not know how to find the unicode value. Can anyone throw somelight on this? thanks in advance. --c

Replies are listed 'Best First'.
Re: To find Cyrillic characters - unicode
by Zaxo (Archbishop) on Aug 03, 2007 at 04:53 UTC

    The regex unicode block property '\P{InCyrillic}' will get you what you want. You may need to open the file in ':utf8' mode.

    Isolating your match to particular xml elements will require one of the XML modules. That ought to make the text utf8 by default, but old perls may be idiosyncratic about that.

    After Compline,
    Zaxo

      Many Thanks!!!!!!!!!!
Re: To find Cyrillic characters - unicode
by graff (Chancellor) on Aug 03, 2007 at 14:12 UTC
    Script should validate that cd element contains only cyrillic characters. If it contains other character set, it should prompt an error.

    Um... is it okay for text within <cd>...</cd> to include spaces, digits, punctuation, etc? These lie outside the Unicode Cyrillic range, but might not be "errors".

    If the tag really is supposed to contain only Cyrillic letters (no whitespace, digits, etc), then something like  warn "Bad content: $cdstr\n" if ($cdstr =~ /\P{InCyrillic}/); really is all you need, as Zaxo suggested.

    Adding more to the "acceptable characters" list is not too complicated (although I did have some trouble with methods that I expected to work based on the perlunicode man page). This seems to work okay for the case where whitespace, digits and punctuation are acceptable along with Cyrillic:

    warn "Bad content: $cdstr\n" unless ( $cdstr =~ /^(?:[\s\d\p{Punctuation}]|\p{Cyrillic})+$/ );
    (I had expected that I could put a bunch of "\p{...}" things inside a single [...] character class, but that didn't work as expected in 5.8.6 or 5.8.8; I even had trouble defining my own subroutine, along the lines explained in perlunicode, and demonstrated here by japhy -- my subroutine ran, but the results were not as expected. I'll be posting a question/bug report to the perl-unicode mailing list.)

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://630446]
Approved by prasadbabu
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others pondering the Monastery: (10)
As of 2019-06-17 14:47 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Is there a future for codeless software?



    Results (79 votes). Check out past polls.

    Notices?
    • (Sep 10, 2018 at 22:53 UTC) Welcome new users!