Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

term weight

by Anonymous Monk
on Mar 04, 2003 at 23:05 UTC ( #240476=perlquestion: print w/ replies, xml ) Need Help??
Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Has anyone knows how to find a term weight of each word in each document in five files. the program must fliter words through the stoplist and calculate the term weight on stems ( as determined by Lingua::Stem::En) not on words. I am totally stuck and hope anyone can help me or show me some hints. thanks

Comment on term weight
Re: term weight
by allolex (Curate) on Mar 05, 2003 at 01:18 UTC

    I see that better minds have not responded, and they usually do sooner than this, so I'll take a stab at it.

    So, taking term weight to mean simply the ratio of the term frequency to the number of tokens in the text, basically you need to:

    1. Stem all the words in the files, save them in a hash (#words) using the tokens as the keys and the stems as the values.
    2. Count the total stems, calculate ratio of term instances (stemmed tokens) to total number of stems.
    3. Make a hash of arrays (#stems) with the stems as the keys and the tokens and their relative frequency as array values.
    4. Combine #words and #stems on the stems into a new hash of arrays, so that each token in the text has the corresponding array values of the stem itself and the term weight.

    It would help us a lot if you posted your ideas about what you need to do as well and perhaps define "term weight" I'm hoping you mean my interpretation above, which is pretty well accepted as a general definition in linguistic circles.

    Maybe someone a little smarter and a little more awake than I am can come up with a way to combine a few of these steps, but it looks like you may have to add stem tags to the text in order to accomplish your goal.

    --
    Allolex

Re: term weight
by rob_au (Abbot) on Mar 05, 2003 at 02:12 UTC
    The previous reply from allolex offers some very good advice on how to approach this problem. In addition to the direction offered in that reply, you may want to have a look at the Perlfect search engine which is written in Perl and implements a very basic stem indexing method.

    There has previously been a discussion on stemming from the perspective of stemming errors at Natural Language Index Stemming.

     

    perl -le 'print+unpack("N",pack("B32","00000000000000000000001000111001"))'

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://240476]
Approved by data64
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others meditating upon the Monastery: (5)
As of 2014-07-31 05:37 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (245 votes), past polls