Endless has asked for the wisdom of the Perl Monks concerning the following question:
Hello friends,
I am converting a lexical processor I wrote in Java to Perl; text processing is supposed to be very good in Perl, and I'm using it as an opportunity to learn Perl. However, although my initial write-up produces the right output, it does it around 70 times slower than my Java implementation where I was using a home-made Trie. According to Diag::NYTProf, the hangup is in _walk_tree of Tree::Trie, which brings me to my question: what is a highly time-effective way to perform matching for words and/or phrases against a target sentence, where the match will also return/allow access to supplementary data on the matched item?
Here is the algorithm I need to implement efficiently:
my $lexicon = $csv -> parse; # words to match against, and suppleme +ntary data to go with matches foreach <tweet> { foreach <word_in_tweet> { if ($lexicon includes <word_in_tweet>) { save match.supplementary_data TO tweet.result_data; } } }
Supplementary data includes topics and sentiment values corresponding to each word/phrase in my dictionary. In the end, I need to know all the topics that match in each tweet.
Important caveat: The dictionary may include multi-word entries, so these need to be matched as well and preferred over shorter matches.
The Question
What might be the best Perl structure to fulfill my needs for:- Matching each and every word and multi-word phrase from my dictionary?
- Retrieving supplementary dictionary data for each matched word/phrase?
- Fulfilling these points with high efficiency?
Is there a more efficient tree implementation? Is Perl's internal hash implementation likely to offer sufficiently efficient alternatives? Can you think of something I'm missing?
Thank you very much for your help!
Update:
For my project, the best results were in line with BrowserUK's suggestion: hashes were vastly superior, although a little trickier to get multi-word matches than regex would have been. Switching from Trie to Hash improved my speed by a factor of nearly 800.
|
---|