Your skill will accomplish what the force of many cannot |
|
PerlMonks |
Re^5: Multi-thread combining the results togetherby Marshall (Canon) |
on Jul 27, 2019 at 02:53 UTC ( [id://11103495]=note: print w/replies, xml ) | Need Help?? |
That's an interesting idea. I was thinking of trying a single string made of space separated tokens. In that case the ^$ would become \b's. And a grep is not needed because I would be doing match global against a single string instead of running the built regex 80K times against each token individually. There is no reason that I couldn't join the tokens by \n and I could try that without modifying build_regex(). As a note, the array of @tokens are all unique. For each token, I want it either fully copied or nothing (a yes/no situation for each of the 80K tokens). A typical regex will have 10-14 terms and produces a result set of about 6 results from 80K possibilities. If I can get maybe a 3x from algorithm improvements and another 3x from parallelization. I would be in the <10 minute max run time range which is "good enough". As it turns out in practice, not every possibility needs to be run and when a token needs to be investigated further for "close matches", I cache the result. More than a decade ago, run time was 20 minutes max on an Win 95 machine. One of the "problems" with software that "works" is that it often winds up being applied to larger and larger data sets. The 80K terms are extracted from 3 million input lines. 12 years ago, this was only 200K input lines and much smaller @tokens array! I appreciate all of the ideas in this thread! I have a lot of experimentation ahead of me. Ultimately, I would like to develop an algorithm that builds some kind of a tree structure which can be traversed much faster than any regex approach. I figure that will be non-trivial to accomplish. Update: I tried the idea of using a multi-line, match global upon a string of \n separated tokens instead of running a regex on each token individually. This didn't work. This is significantly slower than the current code. It produces the same result, albeit slower. Next up: I will try the \b idea.
In Section
Seekers of Perl Wisdom
|
|