Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

Re^2: Problem getting Russian stopwords

by cormanaz (Deacon)
on Sep 18, 2018 at 17:28 UTC ( [id://1222601]=note: print w/replies, xml ) Need Help??


in reply to Re: Problem getting Russian stopwords
in thread Problem getting Russian stopwords

That did it. Many thanks!
  • Comment on Re^2: Problem getting Russian stopwords

Replies are listed 'Best First'.
Re^3: Problem getting Russian stopwords
by Aldebaran (Curate) on Sep 18, 2018 at 19:26 UTC

    Can I bother you to post your solution? I'm not quite there yet:

    $ ./2.stopwords.pl 
    Possible attempt to separate words with commas at ./2.stopwords.pl line 15.
    два|тебя|даже|всегда|из|он|под|этот|человек|опять|там|ж|после|более|от|вы|ней|не|может|хорошо|и|ей|какая|разве|ты|свою|этом|больше|были|было|почти|что|я|со|другой|моя|какой|всю|при|него|сейчас|если|уже|эту|но|нибудь|впрочем|куда|для|зачем|много|конечно|был|в|три|когда|потому|по|у|этого|уж|мой|того|совсем|или|еще|вот|ним|перед|себе|можно|а|сказал|чтобы|всех|наконец|лучше|ведь|ни|за|тот|бы|тоже|к|до|говорил|надо|жизнь|над|вас|сегодня|они|ли|через|она|все|будет|так|чтоб|ничего|с|во|эти|где|этой|хоть|сказала|один|потом|как|чего|такой|ее|про|никогда|тут|здесь|теперь|быть|сам|без|об|же|им|на|них|ну|кажется|сказать|иногда|кто|нас|меня|есть|мне|раз|то|чуть|была|вдруг|вам|себя|только|да|нельзя|ему|чем|между|его|их|нее|нет|о|том|тем|тогда|всего|мы|будто
    
    
    Боже, даруй мне душевный покой Принять то, что я не в силах изменить, Мужество изменить то, что могу, И мудрость отличить одно от другого.
    $ cat 2.stopwords.pl 
    #!/usr/bin/perl -w
    use 5.011;
    use utf8;
    binmode STDOUT, ":encoding(UTF-8)";
    use Lingua::StopWords qw( getStopWords );
    my $stopwords = getStopWords('ru');
    use Encode;
    
    say join "|", map decode("KOI8-R", $_), keys %$stopwords;
    say $/;
     
    my @words = qw( Боже, даруй мне душевный покой
    Принять то, что я не в силах изменить,
    Мужество изменить то, что могу,
    И мудрость отличить одно от другого. );
    
    say join ' ', grep { !$stopwords->{$_} } @words;
    __END__ 
    
    $ 

    что and то are on the list but not "stopped." One has to use pre tags to see the cyrillic....

      > Possible attempt to separate words with commas at /home/choroba/1.pl line 17.

      The warning is right. Remove the commas from the qw and "что," will become "что".

      ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,

        Even so, this fails to remove the word. It occurs to me that the russian list would have to be a lot longer, as they have cases and genders that transform the words on the list. For example "одно от другого" means one from the other, and all of its constituent words are in the list in the nominative, male form, while the neuter одно is not and the genitive другого is not.

        $ ./3.stopwords.pl 
        Боже даруй мне душевный покой
        Принять то что я не в силах изменить
        Мужество изменить то что могу
        И мудрость отличить одно от другого
        $ cat 3.stopwords.pl 
        #!/usr/bin/perl -w
        use 5.011;
        use utf8;
        binmode STDOUT, ":encoding(UTF-8)";
        use Lingua::StopWords qw( getStopWords );
        my $stopwords = getStopWords('ru');
        use Encode;
        
        # say join "|", map decode("KOI8-R", $_), keys %$stopwords;
        # say $/;
        
        my $sentence = "Боже, даруй мне душевный покой
        Принять то, что я не в силах изменить,
        Мужество изменить то, что могу,
        И мудрость отличить одно от другого.";
        
        $sentence =~ s/,//g;
        $sentence =~ s/\.//g;
        
        my @words = split / /, $sentence;
         
        say join ' ', grep { !$stopwords->{$_} } @words;
        __END__ 
        
        $ 
        

        As I look at the module for german, which allegedly works, and compare it to the russian, which seems not to, I notice that there are 2 lists in both. In the german, I can read the special characters, esstet and umlauts, in the first list, while the second is all diamonds with a question mark in middle. In the russian, I can read 0 characters in the first list, and the second list is 100% diamonds with question marks in the middle. I have to wonder if having garden-variety cyrillic in the first list is not what it needs. Abridged listing of the modules:

      map decode("KOI8-R", $_), keys %$stopwords;
      The problem is that your stopwords are left undecoded in the hash. You should produce a new hash containing transformed keys instead of throwing the results of decode out:
      my %stopwords; undef @stopwords{ map decode("KOI8-R", $_), keys %{getStopWords('ru')} };
      Also, the stop words are in lower case, which means that you should lowercase your text too before checking whether it's a stopword or not.
      say join ' ', grep { ! exists $stopwords{lc $_} } @words;
      You may want to split your text on /\W+/ to get the words in one operation.
      Успехов,
        undef @stopwords{ map decode("KOI8-R", $_), keys %{getStopWords("ru")} + };

        That’s a really interesting hash slice trick. I like it. Haven’t learned a new Perl idiom in a long time. Thanks.

        I don’t know anything about Cyrillic alphabets. Would fc be preferable to lc here or is it irrelevant given the character set?

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1222601]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having a coffee break in the Monastery: (3)
As of 2024-04-20 15:39 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found