Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

Re^4: Problem getting Russian stopwords

by Anonymous Monk
on Sep 19, 2018 at 07:26 UTC ( [id://1222628]=note: print w/replies, xml ) Need Help??


in reply to Re^3: Problem getting Russian stopwords
in thread Problem getting Russian stopwords

map decode("KOI8-R", $_), keys %$stopwords;
The problem is that your stopwords are left undecoded in the hash. You should produce a new hash containing transformed keys instead of throwing the results of decode out:
my %stopwords; undef @stopwords{ map decode("KOI8-R", $_), keys %{getStopWords('ru')} };
Also, the stop words are in lower case, which means that you should lowercase your text too before checking whether it's a stopword or not.
say join ' ', grep { ! exists $stopwords{lc $_} } @words;
You may want to split your text on /\W+/ to get the words in one operation.
Успехов,

Replies are listed 'Best First'.
Re^5: Problem getting Russian stopwords
by Your Mother (Archbishop) on Sep 19, 2018 at 08:08 UTC
    undef @stopwords{ map decode("KOI8-R", $_), keys %{getStopWords("ru")} + };

    That’s a really interesting hash slice trick. I like it. Haven’t learned a new Perl idiom in a long time. Thanks.

    I don’t know anything about Cyrillic alphabets. Would fc be preferable to lc here or is it irrelevant given the character set?

      Thank you, in turn, for reminding me of fc! It seems to me that for Cyrillic alphabet as it is used in Russian, fc and lc are equivalent:
      use v5.16; use charnames ':full'; use List::Util 'all'; say all { fc eq lc } map chr, ord("\N{CYRILLIC CAPITAL LETTER A}")..ord("\N{CYRILLIC SMALL LETTER Y +A}") __END__ 1

        That appears to be absolutely classic, terse, ideal test code for this case and I think you deserve more ++s than the apparently single you got from me. I wish you would sign-in and participate with a username to bank the credibility and goodwill and perhaps develop friendships here. I have a fair amount of animosity for and mistrust of anonymous monks at this point. Love to see you leave that stable.

        For the interested, I added this to visualize what’s going on–

        $ perl -Mfeature=fc -Mcharnames=:full -Mv5.16 binmode STDOUT, ":encoding(UTF-8)"; say join " ", $_, lc, fc, uc for map chr, ord("\N{CYRILLIC CAPITAL LETTER A}")..ord("\N{CYRILLIC SMALL LETTE +R YA}");

        Came across this again and added the command line invocation to support the character names and the use of fc and say.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1222628]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others musing on the Monastery: (4)
As of 2024-04-25 06:56 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found