Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"

Polish Characters

by ReiVo (Initiate)
on May 20, 2010 at 14:38 UTC ( #840947=perlquestion: print w/replies, xml ) Need Help??

ReiVo has asked for the wisdom of the Perl Monks concerning the following question:

Hi to all,
I am having lot of trouble in analyzing a piece of text in
Polish. I want to read all words of a text, and count them.
OK, what you normally do is to set up pattern matching
using something like \w. That does not work, it leaves the
special Polish letters kind of the strange l and friends out.
Next approach,
use POSIX qw(locale_h) ;
setlocale(LC_ALL,"Polish_Poland") or die "Could not set locale";
that runs, does not complain, but, same effect as before.
I am running the activestate distro un WinXP.
Thx a lot for every hint

Replies are listed 'Best First'.
Re: Polish Characters
by Corion (Pope) on May 20, 2010 at 14:42 UTC

    You will have to think about what encoding your source code is in, what encoding your target data is in and what encoding your output file should be in. Then, you will need to use Encode::decode resp. utf8 (or encoding) to transform your input and output between the wanted encodings and to tell Perl about the encoding of your source code.

Re: Polish Characters
by moritz (Cardinal) on May 20, 2010 at 15:15 UTC
    I tend to avoid locales, and rely on Unicode semantics for regex matching, because it involves less opaque magic, and also recognizes word characters from other languages (which I consider a feature).

    As Corion mentioned, you have to find out what encodings your input data and script are in, and decode it before using string operations on it.

    See encodings and Unicode in Perl and the Perl Unicode and UTF-8 wikibook for detailed information

Re: Polish Characters
by almut (Canon) on May 20, 2010 at 15:05 UTC
    pattern matching using something like \w

    In addition to what Corion said, note that there are also various unicode category properties available via escapes, e.g. \p{L} (or long: \p{Letter}) for letters, etc. that you can make use of, once you've properly decoded your input.  See perlunicode for details.

Re: Polish Characters
by zby (Vicar) on May 21, 2010 at 10:13 UTC
    Hi there, You posted that to the mailing list and I answered you there - did you not receive my email? Maybe the list server is not working correctly. Anyway here is the example that I typed for you:
    use utf8; binmode(STDOUT, ":utf8"); my $string = "azsc"; while( $string =~ /(\w)/g ){ print $1; } print "\n"; __OUTPUT__ azsc
    Replace the 'azsc' with utf8 encoded characters - and it should work, unfortunately PerlMonks mangles the characters when I try to input them here. This does not depend on the locale - but instead it is using the character semantic, it think this is the more modern approach. The important point is that your data needs to be correctly decoded into the characters. Here I used the utf8 pragma so I could put the characters into the program text, but if you read the data from outside sources you need to decode it - this is covered in multiple online sources for example:
    open(my $fh, "<:encoding(UTF-8)", "filename") || die "can't open UTF-8 encoded filename: $!";
    this snipped is part of the documentation for the 'open' Perl function.

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://840947]
Approved by Corion
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others studying the Monastery: (1)
As of 2021-10-28 04:11 GMT
Find Nodes?
    Voting Booth?
    My first memorable Perl project was:

    Results (95 votes). Check out past polls.