Re^2: Problem getting Russian stopwords

Replies are listed 'Best First'.
Re^3: Problem getting Russian stopwords by Aldebaran (Curate) on Sep 18, 2018 at 19:26 UTC
Can I bother you to post your solution? I'm not quite there yet: $ ./2.stopwords.pl Possible attempt to separate words with commas at ./2.stopwords.pl line 15. два\|тебя\|даже\|всегда\|из\|он\|под\|этот\|человек\|опять\|там\|ж\|после\|более\|от\|вы\|ней\|не\|может\|хорошо\|и\|ей\|какая\|разве\|ты\|свою\|этом\|больше\|были\|было\|почти\|что\|я\|со\|другой\|моя\|какой\|всю\|при\|него\|сейчас\|если\|уже\|эту\|но\|нибудь\|впрочем\|куда\|для\|зачем\|много\|конечно\|был\|в\|три\|когда\|потому\|по\|у\|этого\|уж\|мой\|того\|совсем\|или\|еще\|вот\|ним\|перед\|себе\|можно\|а\|сказал\|чтобы\|всех\|наконец\|лучше\|ведь\|ни\|за\|тот\|бы\|тоже\|к\|до\|говорил\|надо\|жизнь\|над\|вас\|сегодня\|они\|ли\|через\|она\|все\|будет\|так\|чтоб\|ничего\|с\|во\|эти\|где\|этой\|хоть\|сказала\|один\|потом\|как\|чего\|такой\|ее\|про\|никогда\|тут\|здесь\|теперь\|быть\|сам\|без\|об\|же\|им\|на\|них\|ну\|кажется\|сказать\|иногда\|кто\|нас\|меня\|есть\|мне\|раз\|то\|чуть\|была\|вдруг\|вам\|себя\|только\|да\|нельзя\|ему\|чем\|между\|его\|их\|нее\|нет\|о\|том\|тем\|тогда\|всего\|мы\|будто Боже, даруй мне душевный покой Принять то, что я не в силах изменить, Мужество изменить то, что могу, И мудрость отличить одно от другого. $ cat 2.stopwords.pl #!/usr/bin/perl -w use 5.011; use utf8; binmode STDOUT, ":encoding(UTF-8)"; use Lingua::StopWords qw( getStopWords ); my $stopwords = getStopWords('ru'); use Encode; say join "\|", map decode("KOI8-R", $_), keys %$stopwords; say $/; my @words = qw( Боже, даруй мне душевный покой Принять то, что я не в силах изменить, Мужество изменить то, что могу, И мудрость отличить одно от другого. ); say join ' ', grep { !$stopwords->{$_} } @words; __END__ $ что and то are on the list but not "stopped." One has to use pre tags to see the cyrillic....	[reply]
Re^4: Problem getting Russian stopwords by choroba (Cardinal) on Sep 18, 2018 at 19:41 UTC
> Possible attempt to separate words with commas at /home/choroba/1.pl line 17. The warning is right. Remove the commas from the `qw` and "что," will become "что". ($q=q:Sq=~/;[c](.)(.)/;chr(-\|\|-\|5+lengthSq)`"S\|oS2"`map{chr \|+ord }map{substrSq`S_+\|`\|}3E\|-\|`7**2-3:)=~y+S\|`+$1,++print+eval$q,q,a, [download]	[reply] [d/l] [select]
Re^5: Problem getting Russian stopwords by Aldebaran (Curate) on Sep 18, 2018 at 21:12 UTC
Even so, this fails to remove the word. It occurs to me that the russian list would have to be a lot longer, as they have cases and genders that transform the words on the list. For example "одно от другого" means one from the other, and all of its constituent words are in the list in the nominative, male form, while the neuter одно is not and the genitive другого is not. $ ./3.stopwords.pl Боже даруй мне душевный покой Принять то что я не в силах изменить Мужество изменить то что могу И мудрость отличить одно от другого $ cat 3.stopwords.pl #!/usr/bin/perl -w use 5.011; use utf8; binmode STDOUT, ":encoding(UTF-8)"; use Lingua::StopWords qw( getStopWords ); my $stopwords = getStopWords('ru'); use Encode; # say join "\|", map decode("KOI8-R", $_), keys %$stopwords; # say $/; my $sentence = "Боже, даруй мне душевный покой Принять то, что я не в силах изменить, Мужество изменить то, что могу, И мудрость отличить одно от другого."; $sentence =~ s/,//g; $sentence =~ s/\.//g; my @words = split / /, $sentence; say join ' ', grep { !$stopwords->{$_} } @words; __END__ $ As I look at the module for german, which allegedly works, and compare it to the russian, which seems not to, I notice that there are 2 lists in both. In the german, I can read the special characters, esstet and umlauts, in the first list, while the second is all diamonds with a question mark in middle. In the russian, I can read 0 characters in the first list, and the second list is 100% diamonds with question marks in the middle. I have to wonder if having garden-variety cyrillic in the first list is not what it needs. Abridged listing of the modules: Read more... (2 kB)	[reply]
Re^4: Problem getting Russian stopwords by Anonymous Monk on Sep 19, 2018 at 07:26 UTC
`map decode("KOI8-R", $_), keys %$stopwords;` The problem is that your stopwords are left undecoded in the hash. You should produce a new hash containing transformed keys instead of throwing the results of `decode` out: `my %stopwords; undef @stopwords{ map decode("KOI8-R", $_), keys %{getStopWords('ru')} };` [download] Also, the stop words are in lower case, which means that you should lowercase your text too before checking whether it's a stopword or not. `say join ' ', grep { ! exists $stopwords{lc $_} } @words;` [download] You may want to `split` your text on `/\W+/` to get the words in one operation. Успехов,	[reply] [d/l] [select]
Re^5: Problem getting Russian stopwords by Your Mother (Archbishop) on Sep 19, 2018 at 08:08 UTC
`undef @stopwords{ map decode("KOI8-R", $_), keys %{getStopWords("ru")} + };` [download] That�s a really interesting hash slice trick. I like it. Haven�t learned a new Perl idiom in a long time. Thanks. I don�t know anything about Cyrillic alphabets. Would `fc` be preferable to `lc` here or is it irrelevant given the character set?	[reply] [d/l] [select]
Re^6: Problem getting Russian stopwords by Anonymous Monk on Sep 19, 2018 at 20:02 UTC
Re^7: Problem getting Russian stopwords by Your Mother (Archbishop) on Sep 20, 2018 at 07:47 UTC


We don't bite newbies here... much
	PerlMonks