There's more than one way to do things | |
PerlMonks |
comment on |
( [id://3333]=superdoc: print w/replies, xml ) | Need Help?? |
There are as I see it two scalability issues in the above code. The first of these has been addressed well by the various suggestions to use a sliding window ($i...$i+$n on the array of words). The second is the keys themselves. You are currently using the actual words in each phrase as a key. This means that searching for all sequences of N-M words in a text with K word characters (e.g. alpha-numerics) could concievably take up (N+N+1+...+M)*K characters for its keys alone. The actual amount depends on the frequency of each particular word runs: "a a a ..." will obviously use less space than "a b c d e..." or "a..z b..za c..zab ...". If you intend to make either N, the range from N...M or K large, you might want to consider assigning each word in the text a integer id and composing your key out of the integers rather than the word strings as you are doing now. Keys composed out of numerical indexes would save space and possibly search time. In pseudocode, this would work something like this:
Best, beth In reply to Re: extract phrases of n-words length
by ELISHEVA
|
|