comment on

There are as I see it two scalability issues in the above code. The first of these has been addressed well by the various suggestions to use a sliding window ($i...$i+$n on the array of words).

The second is the keys themselves. You are currently using the actual words in each phrase as a key. This means that searching for all sequences of N-M words in a text with K word characters (e.g. alpha-numerics) could concievably take up (N+N+1+...+M)*K characters for its keys alone. The actual amount depends on the frequency of each particular word runs: "a a a ..." will obviously use less space than "a b c d e..." or "a..z b..za c..zab ...".

If you intend to make either N, the range from N...M or K large, you might want to consider assigning each word in the text a integer id and composing your key out of the integers rather than the word strings as you are doing now. Keys composed out of numerical indexes would save space and possibly search time.

In pseudocode, this would work something like this:

my @aWords;  # words from the abstract
my %hWords;  # maps words to their id
my $iUniqueWordsSoFar;  # cheap way to assign ids
my @aIds;    # sliding window with ids from last N-M words

for (0..($#aWords-$n)) {
   #look up/assign word id
   my $sWord = $aWords[$_];
   my $iWord;
   if (exists($hWords{$sWord})) {
     $iWord = $hWords{$sWord};
   } else {
     $iWord = $hWords{$sWord} = ++$iUniqueWordsSoFar;
   }

   # update sliding window of ids for last M words
   shift(@aIds) if scalar(@aIds);
   push @aIds, $iWord;

   # add key to hash for N..M length phrases by taking
   # first X elements of sliding window to construct 
   # the key.
}

# final pass: convert what is left in @aIds to keys and
# update appropriate phrase hashes.
[download]

Best, beth

In reply to Re: extract phrases of n-words length by ELISHEVA
in thread extract phrases of n-words length by arun_kom

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


There's more than one way to do things
	PerlMonks