Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

Comment on

( #3333=superdoc: print w/ replies, xml ) Need Help??

I see that better minds have not responded, and they usually do sooner than this, so I'll take a stab at it.

So, taking term weight to mean simply the ratio of the term frequency to the number of tokens in the text, basically you need to:

  1. Stem all the words in the files, save them in a hash (#words) using the tokens as the keys and the stems as the values.
  2. Count the total stems, calculate ratio of term instances (stemmed tokens) to total number of stems.
  3. Make a hash of arrays (#stems) with the stems as the keys and the tokens and their relative frequency as array values.
  4. Combine #words and #stems on the stems into a new hash of arrays, so that each token in the text has the corresponding array values of the stem itself and the term weight.

It would help us a lot if you posted your ideas about what you need to do as well and perhaps define "term weight" I'm hoping you mean my interpretation above, which is pretty well accepted as a general definition in linguistic circles.

Maybe someone a little smarter and a little more awake than I am can come up with a way to combine a few of these steps, but it looks like you may have to add stem tags to the text in order to accomplish your goal.

--
Allolex


In reply to Re: term weight by allolex
in thread term weight by Anonymous Monk

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • Outside of code tags, you may need to use entities for some characters:
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?
    Username:
    Password:

    What's my password?
    Create A New User
    Chatterbox?
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others meditating upon the Monastery: (7)
    As of 2014-07-31 21:49 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?

      My favorite superfluous repetitious redundant duplicative phrase is:









      Results (253 votes), past polls