Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??

There are as I see it two scalability issues in the above code. The first of these has been addressed well by the various suggestions to use a sliding window ($i...$i+$n on the array of words).

The second is the keys themselves. You are currently using the actual words in each phrase as a key. This means that searching for all sequences of N-M words in a text with K word characters (e.g. alpha-numerics) could concievably take up (N+N+1+...+M)*K characters for its keys alone. The actual amount depends on the frequency of each particular word runs: "a a a ..." will obviously use less space than "a b c d e..." or "a..z b..za c..zab ...".

If you intend to make either N, the range from N...M or K large, you might want to consider assigning each word in the text a integer id and composing your key out of the integers rather than the word strings as you are doing now. Keys composed out of numerical indexes would save space and possibly search time.

In pseudocode, this would work something like this:

my @aWords; # words from the abstract my %hWords; # maps words to their id my $iUniqueWordsSoFar; # cheap way to assign ids my @aIds; # sliding window with ids from last N-M words for (0..($#aWords-$n)) { #look up/assign word id my $sWord = $aWords[$_]; my $iWord; if (exists($hWords{$sWord})) { $iWord = $hWords{$sWord}; } else { $iWord = $hWords{$sWord} = ++$iUniqueWordsSoFar; } # update sliding window of ids for last M words shift(@aIds) if scalar(@aIds); push @aIds, $iWord; # add key to hash for N..M length phrases by taking # first X elements of sliding window to construct # the key. } # final pass: convert what is left in @aIds to keys and # update appropriate phrase hashes.

Best, beth


In reply to Re: extract phrases of n-words length by ELISHEVA
in thread extract phrases of n-words length by arun_kom

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others admiring the Monastery: (6)
As of 2024-04-23 20:04 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found