|No such thing as a small change|
Re: Filtering out stop wordsby bliako (Monsignor)
|on Feb 25, 2020 at 12:29 UTC||Need Help??|
As the other monks said, with the regex method, you have to check a $term against each stopword until a match is found. If no match found, you end up checking with ALL stopwords. The benefit is that it offers you non-exact matching, useful in checking against all the variations of a stopword.
On the other hand you have a hashtable which will offer you only exact matching (therefore if you have variations they will have to be entered each as individual stopwords). But fetching is O(1). Constructing can be expensive but once you construct it you perhaps can serialise it and save to to disk for later use.
The approach I am suggesting is using a binary tree to store your exact stopwords and all their variations if any. In this way you will do at most as many string comparisons as the height of the tree. And what's that? Little if your stopwords can make a nice balanced tree or not. This data structure can also be serialised and saved to disk.
Here is some code to get you started:
Later edit: just to clarify that a binary tree is one in which each node has a maximum of 2 children. Left and Right. An AVL_tree is a binary tree which internally tries to be balanced without user interaction. What is "balanced tree"? It's one that all leaf nodes more or less have the same distance from the root node. So long branches and short branches are avoided. That gives consistent performance. etc.
AVL trees were invented in the Soviet Union in 1962. One of the two mathematicians who invented it also participated in building Kaissa, the first computer chess program to win a championship in 1974.
Another Later Edit: The benefit of using a tree over using a hashtable is that the tree's capacity can not be exhausted unless you run out of RAM. With a hashtable you are limited by the hash-key generator and other internal implementations, so practically its size may have a limit - depending on implementation.