Re: arabic alphabet ... how to deal with?
by graff (Chancellor) on Feb 13, 2009 at 04:14 UTC
|
I agree strongly with everything moritz said. Also, I'm surprised that no one has recommended that you use a hash for your stop-word list, because that will make things a lot simpler (and speed things up as well). And I'm baffled by the fact that you are using the "lc()" function, considering that case distinctions do not exist at all in Arabic letters.
You should look around for some tools that will help you get better acquainted with your data files, and the characters that they're made of. If the stop word list and data files are really utf8, you might want to run them through a couple scripts that have been posted here at the monastery: tlu -- TransLiterate Unicode and unichist -- count/summarize characters in data. They can help you to check whether the files are really utf8, whether they are really comparable, and whether they might have any strange, unexpected properties.
Anyway, try this out, and if "it doesn't work", you will need to explain and/or show evidence to clarify exactly how it fails. I'm including some test data, and you should be able to run it yourself in "test mode" to confirm that it works on the test data, so if it doesn't work on your stop list and data file, the problem is with your data (or your misunderstanding of the data), not with the code. (The stop list and text data I provided for testing are pure nonsense, of course, because I don't know any Persian, but it's all in utf8 Arabic letters and diacritics, and even some punctuation. And since the PM code tags force wide characters into numeric entities, I added some code to convert the test data back into utf8 characters.)
#!/usr/bin/perl
=head1 NAME
stopword-filter
=head1 SYNOPSIS
stopword-filter [-e encoding] stop.list text.file
stopword-filter -t # (runs a simple test on internal utf8 data)
=head1 DESCRIPTION
The stop.list file should contain a set of white-space-separated words
that should be removed from the text file. The remaining words in the
text file (after splitting on non-letter/non-mark characters and remov
+ing
stop words) will be printed to STDOUT, one word per line.
The two files need to have the same character encoding, and STDOUT
will be in that same encoding. The default encoding is utf8.
=cut
use strict;
use warnings;
use Getopt::Std;
my %opt;
my $Usage = "Usage: $0 -t # (to test)\n or: $0 [-e enc] stop.list t
+ext.file\n";
getopts( 'e:t', \%opt ) and ( $opt{t} || @ARGV == 2 ) or die $Usage;
my ( $stoptext, $textdata );
my $enc = $opt{e} || 'utf8';
binmode STDOUT, ":encoding($enc)";
if ( $opt{t} ) {
local $/ = ""; # empty string = "paragraph mode" for reading
binmode DATA, ":encoding($enc)";
$stoptext = <DATA>;
$textdata = <DATA>;
if ( $stoptext =~ /\&#\d+;/ ) { # posting code on PM does this to
+ data
s/\&#(\d+);/chr($1)/eg for ( $stoptext, $textdata );
} # so turn numeric character entities back into utf8 chara
+cters
}
else {
local $/; # undef = "slurp mode" for reading
open( STOP, "<:encoding($enc)", $ARGV[0] )
or die "open failed for stoplist $ARGV[0]: $!\n";
$stoptext = <STOP>;
close STOP;
open( TEXT, "<:encoding($enc)", $ARGV[1] )
or die "open failed for textdata $ARGV[1]: $!\n";
$textdata = <TEXT>;
close TEXT;
}
my %stopword = map { $_ => undef } ( split ' ', $stoptext );
for my $word ( split /[^\pL\pM]+/, $textdata ) {
next if ( exists( $stopword{$word} ));
print "$word\n";
}
__DATA__
فُو
بَر
بَز
فَلُزِن برل
+;كو، فُو تِد
+616;ّلِي بَر. سُ
+;كُون
بَز مَلرِيœ
+7; فُو! بَر، نَ
+د بَز مِس.
Update: This ought to work in any language that uses white-space and punctuation to separate words, and it should work for any input encoding, provided that (a) the stop word list and text data are in the same language and same encoding, (b) you know how to identify the encoding, and (c) Encode supports it (a lot of encodings are supported, including all the Arabic ones). | [reply] [Watch: Dir/Any] [d/l] |
|
Thanks a lot, this was indeed very very helpfull ... I could solve my problem ... Thanks again ...
| [reply] [Watch: Dir/Any] |
|
can i ask i have unicode string how to make it appear in arabic alphabets form?
| [reply] [Watch: Dir/Any] |
Re: arabic alphabet ... how to deal with?
by kennethk (Abbot) on Feb 12, 2009 at 16:41 UTC
|
Read through perlunicode. All your I/O operations need to be performed in UTF-8. That means not only open (STOPWORDS, '<:encoding(UTF-8)', $ARGV[1]) as ForgotPasswordAgain suggests and open (INFILE, '<:encoding(UTF-8)', $ARGV[0]) as derby suggests, but also binmode STDOUT, ":encoding(utf8)" before you try to print. The fact that it works with "standard" text says it is almost guaranteed to be a Unicode problem.
| [reply] [Watch: Dir/Any] [d/l] [select] |
|
I tried this way as well before, this way no output ;)
#!/usr/bin/perl
open (STOPWORDS, '<:encoding(UTF-8)', $ARGV[1]) || die "Error opening
+the stopwords file\n";
$count = 0;
while ($word = <STOPWORDS>)
{
chop($word);
$stopword[$count] = lc($word);
$count++;
}
close(STOPWORDS);
open (INFILE ,'<:encoding(UTF-8)', $ARGV[0]) || die "Error opening the
+ input file\n";
while ($line = <INFILE>)
{
chop($line);
@entry = split(/ /, $line);
$i = 0;
while ($entry[$i])
{
$found = 0;
$j = 0;
while (($j<=$count) && ($found==0))
{
if (lc($entry[$i]) eq $stopword[$j])
{
$found = 1;
}
$j++;
}
if ($found == 0)
{
print FH "$entry[$i]\n";
}
$i++;
}
}
close(INFILE);
| [reply] [Watch: Dir/Any] [d/l] |
|
In this case, you have an orphaned file handle FH which is never associated with a file or channel.
| [reply] [Watch: Dir/Any] [d/l] |
|
|
use Devel::Peek;
...
Dump lc($entry[$i]);
Dump $stopword[$j];
if (lc($entry[$i]) eq $stopword[$j])
{
...
| [reply] [Watch: Dir/Any] [d/l] |
Re: arabic alphabet ... how to deal with?
by moritz (Cardinal) on Feb 12, 2009 at 18:28 UTC
|
There are some possibilities of what can go wrong:
- Character encoding. Others have commented on that, so I'll keep it brief: do you know exactly in which character encodings your two text files are? if so, is that source of information reliable? You can't work with text if you don't know how it's encoded, so you need to know for sure.
- Data format: Are you sure that all files have the same line endings? and that they don't contain any non-printable characters that make your comparisons go awry? For Arabic text it might well be that it contains bidi-markers that your text editor doesn't show, but Perl might become confused by them
- Normalization: If I'm informed correctly, the Arabic script makes heavy use of diacritic marks. That means that many characters have multiple representations (pre-composed and separate base character/diacritics). In that case Unicode::Normalize can help you
- Cultural misunderstanding: If you don't know the script and the language, it might be that things that look identical to you actually vary in small details, and the words you want to remove aren't actually part of your input data.
| [reply] [Watch: Dir/Any] |
Re: arabic alphabet ... how to deal with?
by ForgotPasswordAgain (Priest) on Feb 12, 2009 at 15:49 UTC
|
open (STOPWORDS, '<:encoding(UTF-8)', $ARGV[1])
| [reply] [Watch: Dir/Any] [d/l] |
|
I did but still does not work ...
| [reply] [Watch: Dir/Any] |
|
What about the infile? That needs to be opened UTF-8 as well.
| [reply] [Watch: Dir/Any] |
|
Re: arabic alphabet ... how to deal with?
by JavaFan (Canon) on Feb 12, 2009 at 15:17 UTC
|
Perhaps you can explain what it does, and what you want it to do. Now it's just a bunch of code and "it doesn't work". I'm not going to spend time to invent some sample data to see what it does, and from that deduce what you want. | [reply] [Watch: Dir/Any] |
|
my script suppose to remove all stop words that i have in a file (second argument) from the text (first argument) and output the filtered text one word per line ...
| [reply] [Watch: Dir/Any] |
|
Goodie. One step down. Several more to go. You get one more chance. What does your blob of code currently do?
| [reply] [Watch: Dir/Any] |
|
Re: arabic alphabet ... how to deal with?
by Anonymous Monk on Feb 12, 2009 at 15:25 UTC
|
You probably want chomp not chop to remove a newline from your stop words. | [reply] [Watch: Dir/Any] |
|
it works for English well, with chop as well, but the problem is working with Arabic script (Farsi)
| [reply] [Watch: Dir/Any] |
|
OK, As you say that's not the problem you have, but chomp will only remove newlines ($/) whereas chop will remove any last character.
| [reply] [Watch: Dir/Any] |