Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?

comment on

( #3333=superdoc: print w/replies, xml ) Need Help??

As the other monks said, with the regex method, you have to check a $term against each stopword until a match is found. If no match found, you end up checking with ALL stopwords. The benefit is that it offers you non-exact matching, useful in checking against all the variations of a stopword.

On the other hand you have a hashtable which will offer you only exact matching (therefore if you have variations they will have to be entered each as individual stopwords). But fetching is O(1). Constructing can be expensive but once you construct it you perhaps can serialise it and save to to disk for later use.

The approach I am suggesting is using a binary tree to store your exact stopwords and all their variations if any. In this way you will do at most as many string comparisons as the height of the tree. And what's that? Little if your stopwords can make a nice balanced tree or not. This data structure can also be serialised and saved to disk.

Here is some code to get you started:

# author: bliako # date: 25/02/2020 # for: use strict; use warnings; use AVLTree; # create a binary tree with custom string comparator # if cmp < 0, word goes to the left of current node # if cmp > 0, word goes to the right # if cmp == 0, word is identical to the current node my $tree = AVLTree->new(sub { $_[0] cmp $_[1] }); open(my $IN, '<', '/usr/share/dict/words'); my $tstart = time; my $numwords = 0; while( my $aword=<$IN> ){ chomp $aword; $tree->insert($aword); $numwords++; } close($IN); print "$0 : tree size: ".$tree->size()." (same as $numwords in file).\ +n"; print "$0 : inserted $numwords words in ".(time-$tstart)." seconds.\n" +; open($IN, '<', '/usr/share/dict/words'); $numwords = 0; $tstart = time; while( my $astopword=<$IN> ){ chomp $astopword; next if rand>0.05; if( ! $tree->find($astopword) ){ print STDERR "$0 : error, did not + find the word '$astopword', that's not right\n"; exit(1); } $numwords++; } close($IN); # edit: forgot that! print "search $numwords words in ".(time-$tstart)." seconds.\n";

Note that with an XS module like AVLTree it may not be possible to serialise, in which case use a pure Perl module like Tree::Binary.

Later edit: just to clarify that a binary tree is one in which each node has a maximum of 2 children. Left and Right. An AVL_tree is a binary tree which internally tries to be balanced without user interaction. What is "balanced tree"? It's one that all leaf nodes more or less have the same distance from the root node. So long branches and short branches are avoided. That gives consistent performance. etc.

AVL trees were invented in the Soviet Union in 1962. One of the two mathematicians who invented it also participated in building Kaissa, the first computer chess program to win a championship in 1974.

Another Later Edit: The benefit of using a tree over using a hashtable is that the tree's capacity can not be exhausted unless you run out of RAM. With a hashtable you are limited by the hash-key generator and other internal implementations, so practically its size may have a limit - depending on implementation.

bw, bliako

In reply to Re: Filtering out stop words by bliako
in thread Filtering out stop words by IB2017

Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":

  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or or How to display code and escape characters are good places to start.
Log In?

What's my password?
Create A New User
Domain Nodelet?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others surveying the Monastery: (4)
As of 2022-01-24 15:21 GMT
Find Nodes?
    Voting Booth?
    In 2022, my preferred method to securely store passwords is:

    Results (64 votes). Check out past polls.