<?xml version="1.0" encoding="windows-1252"?>
<node id="743338" title="arabic alphabet ... how to deal with?" created="2009-02-12 10:13:11" updated="2009-02-12 10:13:11">
<type id="115">
perlquestion</type>
<author id="961">
Anonymous Monk</author>
<data>
<field name="doctext">
Dear monks 

I have a Persian text and the list of stop words, I would like to remove all stop words but the results is not satisfactory. here is my code: 

&lt;code&gt;
open (STOPWORDS, $ARGV[1]) || die "Error opening the stopwords file\n";
$count = 0;
while ($word = &lt;STOPWORDS&gt;)
{
	chop($word);
	$stopword[$count] = lc($word);
	$count++;
}
close(STOPWORDS);

open (INFILE, $ARGV[0]) || die "Error opening the input file\n";
while ($line = &lt;INFILE&gt;)
{
    chop($line);
    @entry = split(/ /, $line);	
    $i = 0;
    while ($entry[$i])    
    {
		$found = 0;
		$j = 0;
		while (($j&lt;=$count) &amp;&amp; ($found==0))
		{
			if (lc($entry[$i]) eq $stopword[$j])
			{
				$found = 1;
			}
			$j++;
		}
		if ($found == 0)
		{
			print "$entry[$i]\n";
		}
		$i++;
    }
}
close(INFILE);
&lt;/code&gt;

I cant put sample of my stop word list since it doesnt appear here, its one word per line and my input text is not tokenized and its just a raw uni-code text. any idea how can i make it work?

Thanks in advance.</field>
<field name="reputation">
2</field>
</data>
</node>
