http://www.perlmonks.org?node_id=11113393


in reply to Filtering out stop words

As I commented over in Building Regex Alternations Dynamically, it's possible to generate a regex from the entirety of /usr/share/dict/words, which on my system currently has over 100,000 entries, resulting in a regex that has a string length of 1MB. Matching against that regex is still relatively performant. So building a regex in the way you showed is possible; whether it's the best solution in your case probably depends on how many matches you'll be doing with that regex, and you'll have to measure the performance in your use case. I would recommend that loadCommonWords should return a regex precompiled with qr// instead of a string, and that you sort @commonwords by length, as I showed in the aforementioned thread.

Update: Eily is right, I overlooked the anchors: for exact string matches, definitely use a hash instead.

Replies are listed 'Best First'.
Re^2: Filtering out stop words (updated)
by IB2017 (Pilgrim) on Feb 25, 2020 at 10:26 UTC

    What worried me is the growth factor of my data (stop words, etc.) since the scripts were first designed. I definitely need to write tests to check performance on the real-life application.