Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Filtering out stop words

by IB2017 (Pilgrim)
on Feb 25, 2020 at 09:43 UTC ( #11113388=perlquestion: print w/replies, xml ) Need Help??

IB2017 has asked for the wisdom of the Perl Monks concerning the following question:

Hello

I used to check if a work needs to be exluded from processing checking if it is contained in a stop words list. I used this method:

my $CkDiscardCommonwords=1;#check if use stopwords or not my $term="word"; my $commonwordsRX = loadCommonWords (); if ($CkDiscardCommonwords eq 1){ if ($term =~ /^(?:$commonwordsRX)$/){ return (0); } } sub loadCommonWords { my @commonwords; my $filename="commonWords.txt"; if (open $FH, "<:encoding(UTF-8)", $filename) { while (my $line = <$FH>) { chomp $line; push @commonwords, $line; } close $FH; } my $commonwordsRX = join "|", map quotemeta, @commonwords; return $commonwordsRX; }

Now my sooftware has changed and the list of common words saved in commonWords.txt may grow exponencially. It used to be small (~300 words), now it could reach x-thousands.

I would like to hear what expert monks think about this implementation. Would a Regex constructed in this way cause problems when it grows? Should I choose another approach?

Replies are listed 'Best First'.
Re: Filtering out stop words (updated)
by haukex (Bishop) on Feb 25, 2020 at 10:01 UTC

    As I commented over in Building Regex Alternations Dynamically, it's possible to generate a regex from the entirety of /usr/share/dict/words, which on my system currently has over 100,000 entries, resulting in a regex that has a string length of 1MB. Matching against that regex is still relatively performant. So building a regex in the way you showed is possible; whether it's the best solution in your case probably depends on how many matches you'll be doing with that regex, and you'll have to measure the performance in your use case. I would recommend that loadCommonWords should return a regex precompiled with qr// instead of a string, and that you sort @commonwords by length, as I showed in the aforementioned thread.

    Update: Eily is right, I overlooked the anchors: for exact string matches, definitely use a hash instead.

      What worried me is the growth factor of my data (stop words, etc.) since the scripts were first designed. I definitely need to write tests to check performance on the real-life application.

Re: Filtering out stop words
by Eily (Monsignor) on Feb 25, 2020 at 10:09 UTC

    If you're searching for an exact match you might as well use a hash:

    my $commonWords = {}; # Make commonWords a ref to an empty hash ... while (my $line = <>) { chomp($line); $commonWords->{$line} = 1; } ... if ($CkDiscardCommonwords eq 1) { return(0) if exists $commonWords->{$term}; }
    I suppose you extracted the code from your script and that's why you have a return outside of a function?

Re: Filtering out stop words
by bliako (Prior) on Feb 25, 2020 at 12:29 UTC

    As the other monks said, with the regex method, you have to check a $term against each stopword until a match is found. If no match found, you end up checking with ALL stopwords. The benefit is that it offers you non-exact matching, useful in checking against all the variations of a stopword.

    On the other hand you have a hashtable which will offer you only exact matching (therefore if you have variations they will have to be entered each as individual stopwords). But fetching is O(1). Constructing can be expensive but once you construct it you perhaps can serialise it and save to to disk for later use.

    The approach I am suggesting is using a binary tree to store your exact stopwords and all their variations if any. In this way you will do at most as many string comparisons as the height of the tree. And what's that? Little if your stopwords can make a nice balanced tree or not. This data structure can also be serialised and saved to disk.

    Here is some code to get you started:

    # author: bliako # date: 25/02/2020 # for: https://perlmonks.org/?node_id=11113388 use strict; use warnings; use AVLTree; # create a binary tree with custom string comparator # if cmp < 0, word goes to the left of current node # if cmp > 0, word goes to the right # if cmp == 0, word is identical to the current node my $tree = AVLTree->new(sub { $_[0] cmp $_[1] }); open(my $IN, '<', '/usr/share/dict/words'); my $tstart = time; my $numwords = 0; while( my $aword=<$IN> ){ chomp $aword; $tree->insert($aword); $numwords++; } close($IN); print "$0 : tree size: ".$tree->size()." (same as $numwords in file).\ +n"; print "$0 : inserted $numwords words in ".(time-$tstart)." seconds.\n" +; open($IN, '<', '/usr/share/dict/words'); $numwords = 0; $tstart = time; while( my $astopword=<$IN> ){ chomp $astopword; next if rand>0.05; if( ! $tree->find($astopword) ){ print STDERR "$0 : error, did not + find the word '$astopword', that's not right\n"; exit(1); } $numwords++; } close($IN); # edit: forgot that! print "search $numwords words in ".(time-$tstart)." seconds.\n";

    Note that with an XS module like AVLTree it may not be possible to serialise, in which case use a pure Perl module like Tree::Binary.

    Later edit: just to clarify that a binary tree is one in which each node has a maximum of 2 children. Left and Right. An AVL_tree is a binary tree which internally tries to be balanced without user interaction. What is "balanced tree"? It's one that all leaf nodes more or less have the same distance from the root node. So long branches and short branches are avoided. That gives consistent performance. etc.

    AVL trees were invented in the Soviet Union in 1962. One of the two mathematicians who invented it also participated in building Kaissa, the first computer chess program to win a championship in 1974.

    Another Later Edit: The benefit of using a tree over using a hashtable is that the tree's capacity can not be exhausted unless you run out of RAM. With a hashtable you are limited by the hash-key generator and other internal implementations, so practically its size may have a limit - depending on implementation.

    bw, bliako

Re: Filtering out stop words
by talexb (Canon) on Feb 25, 2020 at 13:17 UTC

    You're building a lookup cache .. there are many ways to do it -- I agree that using a hash is the best way. That being said, depending on how often you'll be using this logic, you might want to think about figuring out whether you load the hash every time you do a search, and how often you re-load, when new words are added. You could freeze and thaw your hash to improve start-up performance. You could even see if loading the words into a database (perhaps using sqlite3) would be a better solution. It's all a balancing act between performance and ease of use (when updating).

    Actually, that's also a pretty cool interview question. :)

    Alex / talexb / Toronto

    Thanks PJ. We owe you so much. Groklaw -- RIP -- 2003 to 2013.

      A bloom filter in front of an authoritative sqlite DB might be an interesting / fruitful path of exploration depending on the size of your word sets.

      The cake is a lie.
      The cake is a lie.
      The cake is a lie.

Re: Filtering out stop words
by Eily (Monsignor) on Feb 25, 2020 at 14:04 UTC

    Yet another way to do it, which might be way faster if you have very few words to check (in your example you have only one, but this might be done in a loop) is to do it the other way around. Collect the words to be tested first, and then check if the any of the words in your dictionary match:

    my @words = get_words_to_check(); my %hash = map { $_ => 1 } @words; while (my $line = <>) { chomp $line; delete $hash{$line} if exists $hash{$line}; # The if exists isn't re +quired here, but it does make it look cleaner } my @good_words = grep { exists $hash{$_} } @words; # Keep the original + order my @good_words_2 = keys %hash; # Don't care about the original order

    Or, if borth word lists are sorted, something like this might work:

    my $index = 0; LINE: while (my $line = <>) { chomp $line; # While the word in the dictionary is past (or equal) to the word to + check while ($line ge $words[$index]) { # Store the word as OK, unless it is equal to the current dict wor +d push @good_words, $words[$index] unless $line eq $words[$index]; # Use the next word from the list of words to check, if any last LINE if ++$index == @words; } }

Re: Filtering out stop words
by Ea (Chaplain) on Feb 26, 2020 at 09:45 UTC
    Is your stop words file growing because you are adding all possible endings for your words? In which case you might start thinking about Stemming. The standard module that implements Porter's stemming is Lingua::Stem or maybe Text::Context::Porter is more to your liking. If you're text processing you'll probably end up here eventually.

    Ea

    Sometimes I can think of 6 impossible LDAP attributes before breakfast.

    Mojoconf was great!

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://11113388]
Approved by haukex
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others rifling through the Monastery: (3)
As of 2020-10-23 06:24 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    My favourite web site is:












    Results (235 votes). Check out past polls.

    Notices?