|We don't bite newbies here... much|
Possibly yes. I'll try to explain it. I'm not sure how to do this without flooding you with details. But I'll take my chance.
I have a text corpus. The text corpus consists of sentences. A sentence consists of tokens. Each token has 3 components: a word, a correct tag, and an initial tag. All token elements are encoded in integers.
The purpose is to build rules for the whole corpus that change the initial tag into the correct tag. See http://en.wikipedia.org/wiki/Brill_tagger.
The hash I want to build has predicates as keys , and as values the locations in the corpus.
A predicate is a sequence of integers that can be applied to a particular location in the corpus. That is to say: if the sequence of integers can be matched with the sequence of integers at a particular location in the corpus, then I create an entry in the hash with as key the predicate and as value the location.
Currently the predicate is transformed into a string (join ' ', @$predicate), and the string is used as the key for the hash. The problem is that lower in the program I need to split this key back into its elements, to see if it matches elsewhere in the corpus.
The process can be very time (and memory) consuming, so I'm trying to speed it up a little.