|laziness, impatience, and hubris|
Efficient matching with accompanying databy Endless (Beadle)
|on Jul 11, 2013 at 00:23 UTC||Need Help??|
Endless has asked for the
wisdom of the Perl Monks concerning the following question:
I am converting a lexical processor I wrote in Java to Perl; text processing is supposed to be very good in Perl, and I'm using it as an opportunity to learn Perl. However, although my initial write-up produces the right output, it does it around 70 times slower than my Java implementation where I was using a home-made Trie. According to Diag::NYTProf, the hangup is in _walk_tree of Tree::Trie, which brings me to my question: what is a highly time-effective way to perform matching for words and/or phrases against a target sentence, where the match will also return/allow access to supplementary data on the matched item?
Here is the algorithm I need to implement efficiently:
Supplementary data includes topics and sentiment values corresponding to each word/phrase in my dictionary. In the end, I need to know all the topics that match in each tweet.
Important caveat: The dictionary may include multi-word entries, so these need to be matched as well and preferred over shorter matches.
The QuestionWhat might be the best Perl structure to fulfill my needs for:
Is there a more efficient tree implementation? Is Perl's internal hash implementation likely to offer sufficiently efficient alternatives? Can you think of something I'm missing?
Thank you very much for your help!
Update:For my project, the best results were in line with BrowserUK's suggestion: hashes were vastly superior, although a little trickier to get multi-word matches than regex would have been. Switching from Trie to Hash improved my speed by a factor of nearly 800.