arabic alphabet ... how to deal with?

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: arabic alphabet ... how to deal with? by graff (Chancellor) on Feb 13, 2009 at 04:14 UTC
I agree strongly with everything moritz said. Also, I'm surprised that no one has recommended that you use a hash for your stop-word list, because that will make things a lot simpler (and speed things up as well). And I'm baffled by the fact that you are using the "lc()" function, considering that case distinctions do not exist at all in Arabic letters. You should look around for some tools that will help you get better acquainted with your data files, and the characters that they're made of. If the stop word list and data files are really utf8, you might want to run them through a couple scripts that have been posted here at the monastery: tlu -- TransLiterate Unicode and unichist -- count/summarize characters in data. They can help you to check whether the files are really utf8, whether they are really comparable, and whether they might have any strange, unexpected properties. Anyway, try this out, and if "it doesn't work", you will need to explain and/or show evidence to clarify exactly how it fails. I'm including some test data, and you should be able to run it yourself in "test mode" to confirm that it works on the test data, so if it doesn't work on your stop list and data file, the problem is with your data (or your misunderstanding of the data), not with the code. (The stop list and text data I provided for testing are pure nonsense, of course, because I don't know any Persian, but it's all in utf8 Arabic letters and diacritics, and even some punctuation. And since the PM code tags force wide characters into numeric entities, I added some code to convert the test data back into utf8 characters.) #!/usr/bin/perl =head1 NAME stopword-filter =head1 SYNOPSIS stopword-filter [-e encoding] stop.list text.file stopword-filter -t # (runs a simple test on internal utf8 data) =head1 DESCRIPTION The stop.list file should contain a set of white-space-separated words that should be removed from the text file. The remaining words in the text file (after splitting on non-letter/non-mark characters and remov +ing stop words) will be printed to STDOUT, one word per line. The two files need to have the same character encoding, and STDOUT will be in that same encoding. The default encoding is utf8. =cut use strict; use warnings; use Getopt::Std; my %opt; my $Usage = "Usage: $0 -t # (to test)\n or: $0 [-e enc] stop.list t +ext.file\n"; getopts( 'e:t', \%opt ) and ( $opt{t} \|\| @ARGV == 2 ) or die $Usage; my ( $stoptext, $textdata ); my $enc = $opt{e} \|\| 'utf8'; binmode STDOUT, ":encoding($enc)"; if ( $opt{t} ) { local $/ = ""; # empty string = "paragraph mode" for reading binmode DATA, ":encoding($enc)"; $stoptext = <DATA>; $textdata = <DATA>; if ( $stoptext =~ /\&#\d+;/ ) { # posting code on PM does this to + data s/\&#(\d+);/chr($1)/eg for ( $stoptext, $textdata ); } # so turn numeric character entities back into utf8 chara +cters } else { local $/; # undef = "slurp mode" for reading open( STOP, "<:encoding($enc)", $ARGV[0] ) or die "open failed for stoplist $ARGV[0]: $!\n"; $stoptext = <STOP>; close STOP; open( TEXT, "<:encoding($enc)", $ARGV[1] ) or die "open failed for textdata $ARGV[1]: $!\n"; $textdata = <TEXT>; close TEXT; } my %stopword = map { $_ => undef } ( split ' ', $stoptext ); for my $word ( split /[^\pL\pM]+/, $textdata ) { next if ( exists( $stopword{$word} )); print "$word\n"; } __DATA__ فُو بَر بَز فَلُزِن بر&#1604 +;كو، فُو تِد&#1 +616;ّلِي بَر. س&#1615 +;كُون بَز مَلرِي&#156 +7; فُو! بَر، نَ +د بَز مِس. [download] Update: This ought to work in any language that uses white-space and punctuation to separate words, and it should work for any input encoding, provided that (a) the stop word list and text data are in the same language and same encoding, (b) you know how to identify the encoding, and (c) Encode supports it (a lot of encodings are supported, including all the Arabic ones).	[reply] [d/l]
Re^2: arabic alphabet ... how to deal with? by Anonymous Monk on Feb 13, 2009 at 13:49 UTC
Thanks a lot, this was indeed very very helpfull ... I could solve my problem ... Thanks again ...	[reply]
Re^3: arabic alphabet ... how to deal with? by Anonymous Monk on Feb 12, 2013 at 19:29 UTC
can i ask i have unicode string how to make it appear in arabic alphabets form?	[reply]
Re: arabic alphabet ... how to deal with? by kennethk (Abbot) on Feb 12, 2009 at 16:41 UTC
Read through perlunicode. All your I/O operations need to be performed in UTF-8. That means not only `open (STOPWORDS, '<:encoding(UTF-8)', $ARGV[1])` as ForgotPasswordAgain suggests and `open (INFILE, '<:encoding(UTF-8)', $ARGV[0])` as derby suggests, but also `binmode STDOUT, ":encoding(utf8)"` before you try to print. The fact that it works with "standard" text says it is almost guaranteed to be a Unicode problem.	[reply] [d/l] [select]
Re^2: arabic alphabet ... how to deal with? by Anonymous Monk on Feb 12, 2009 at 16:53 UTC
I tried this way as well before, this way no output ;) #!/usr/bin/perl open (STOPWORDS, '<:encoding(UTF-8)', $ARGV[1]) \|\| die "Error opening +the stopwords file\n"; $count = 0; while ($word = <STOPWORDS>) { chop($word); $stopword[$count] = lc($word); $count++; } close(STOPWORDS); open (INFILE ,'<:encoding(UTF-8)', $ARGV[0]) \|\| die "Error opening the + input file\n"; while ($line = <INFILE>) { chop($line); @entry = split(/ /, $line); $i = 0; while ($entry[$i]) { $found = 0; $j = 0; while (($j<=$count) && ($found==0)) { if (lc($entry[$i]) eq $stopword[$j]) { $found = 1; } $j++; } if ($found == 0) { print FH "$entry[$i]\n"; } $i++; } } close(INFILE); [download]	[reply] [d/l]
Re^3: arabic alphabet ... how to deal with? by kennethk (Abbot) on Feb 12, 2009 at 17:22 UTC
In this case, you have an orphaned file handle `FH` which is never associated with a file or channel.	[reply] [d/l]
Re^4: arabic alphabet ... how to deal with? by Anonymous Monk on Feb 12, 2009 at 17:41 UTC
Re^3: arabic alphabet ... how to deal with? by almut (Canon) on Feb 12, 2009 at 21:19 UTC
Use Devel::Peek to get an ASCII-printable representation of the strings you're comparing, and then verify that what you think should match is in fact identical: `use Devel::Peek; ... Dump lc($entry[$i]); Dump $stopword[$j]; if (lc($entry[$i]) eq $stopword[$j]) { ...` [download]	[reply] [d/l]
Re: arabic alphabet ... how to deal with? by moritz (Cardinal) on Feb 12, 2009 at 18:28 UTC
There are some possibilities of what can go wrong: Character encoding. Others have commented on that, so I'll keep it brief: do you know exactly in which character encodings your two text files are? if so, is that source of information reliable? You can't work with text if you don't know how it's encoded, so you need to know for sure. Data format: Are you sure that all files have the same line endings? and that they don't contain any non-printable characters that make your comparisons go awry? For Arabic text it might well be that it contains bidi-markers that your text editor doesn't show, but Perl might become confused by them Normalization: If I'm informed correctly, the Arabic script makes heavy use of diacritic marks. That means that many characters have multiple representations (pre-composed and separate base character/diacritics). In that case Unicode::Normalize can help you Cultural misunderstanding: If you don't know the script and the language, it might be that things that look identical to you actually vary in small details, and the words you want to remove aren't actually part of your input data.	[reply]
Re: arabic alphabet ... how to deal with? by ForgotPasswordAgain (Priest) on Feb 12, 2009 at 15:49 UTC
Did you try printing out the stop words? You probably have to read it in utf8 mode: `open (STOPWORDS, '<:encoding(UTF-8)', $ARGV[1])` [download]	[reply] [d/l]
Re^2: arabic alphabet ... how to deal with? by Anonymous Monk on Feb 12, 2009 at 15:54 UTC
I did but still does not work ...	[reply]
Re^3: arabic alphabet ... how to deal with? by derby (Abbot) on Feb 12, 2009 at 16:19 UTC
What about the infile? That needs to be opened UTF-8 as well. -derby	[reply]
Re^4: arabic alphabet ... how to deal with? by Anonymous Monk on Feb 12, 2009 at 16:37 UTC
Re: arabic alphabet ... how to deal with? by JavaFan (Canon) on Feb 12, 2009 at 15:17 UTC
Perhaps you can explain what it does, and what you want it to do. Now it's just a bunch of code and "it doesn't work". I'm not going to spend time to invent some sample data to see what it does, and from that deduce what you want.	[reply]
Re^2: arabic alphabet ... how to deal with? by Anonymous Monk on Feb 12, 2009 at 15:20 UTC
my script suppose to remove all stop words that i have in a file (second argument) from the text (first argument) and output the filtered text one word per line ...	[reply]
Re^3: arabic alphabet ... how to deal with? by JavaFan (Canon) on Feb 12, 2009 at 15:24 UTC
Goodie. One step down. Several more to go. You get one more chance. What does your blob of code currently do?	[reply]
Re^4: arabic alphabet ... how to deal with? by Anonymous Monk on Feb 12, 2009 at 15:31 UTC
Re: arabic alphabet ... how to deal with? by Anonymous Monk on Feb 12, 2009 at 15:25 UTC
You probably want chomp not chop to remove a newline from your stop words.	[reply]
Re^2: arabic alphabet ... how to deal with? by Anonymous Monk on Feb 12, 2009 at 15:32 UTC
it works for English well, with chop as well, but the problem is working with Arabic script (Farsi)	[reply]
Re^3: arabic alphabet ... how to deal with? by Anonymous Monk on Feb 12, 2009 at 17:07 UTC
OK, As you say that's not the problem you have, but chomp will only remove newlines ($/) whereas chop will remove any last character.	[reply]


The stupid question is the question not asked
	PerlMonks