Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"

Re: arabic alphabet ... how to deal with?

by graff (Chancellor)
on Feb 13, 2009 at 04:14 UTC ( #743521=note: print w/replies, xml ) Need Help??

in reply to arabic alphabet ... how to deal with?

I agree strongly with everything moritz said. Also, I'm surprised that no one has recommended that you use a hash for your stop-word list, because that will make things a lot simpler (and speed things up as well). And I'm baffled by the fact that you are using the "lc()" function, considering that case distinctions do not exist at all in Arabic letters.

You should look around for some tools that will help you get better acquainted with your data files, and the characters that they're made of. If the stop word list and data files are really utf8, you might want to run them through a couple scripts that have been posted here at the monastery: tlu -- TransLiterate Unicode and unichist -- count/summarize characters in data. They can help you to check whether the files are really utf8, whether they are really comparable, and whether they might have any strange, unexpected properties.

Anyway, try this out, and if "it doesn't work", you will need to explain and/or show evidence to clarify exactly how it fails. I'm including some test data, and you should be able to run it yourself in "test mode" to confirm that it works on the test data, so if it doesn't work on your stop list and data file, the problem is with your data (or your misunderstanding of the data), not with the code. (The stop list and text data I provided for testing are pure nonsense, of course, because I don't know any Persian, but it's all in utf8 Arabic letters and diacritics, and even some punctuation. And since the PM code tags force wide characters into numeric entities, I added some code to convert the test data back into utf8 characters.)

#!/usr/bin/perl =head1 NAME stopword-filter =head1 SYNOPSIS stopword-filter [-e encoding] stop.list text.file stopword-filter -t # (runs a simple test on internal utf8 data) =head1 DESCRIPTION The stop.list file should contain a set of white-space-separated words that should be removed from the text file. The remaining words in the text file (after splitting on non-letter/non-mark characters and remov +ing stop words) will be printed to STDOUT, one word per line. The two files need to have the same character encoding, and STDOUT will be in that same encoding. The default encoding is utf8. =cut use strict; use warnings; use Getopt::Std; my %opt; my $Usage = "Usage: $0 -t # (to test)\n or: $0 [-e enc] stop.list t +ext.file\n"; getopts( 'e:t', \%opt ) and ( $opt{t} || @ARGV == 2 ) or die $Usage; my ( $stoptext, $textdata ); my $enc = $opt{e} || 'utf8'; binmode STDOUT, ":encoding($enc)"; if ( $opt{t} ) { local $/ = ""; # empty string = "paragraph mode" for reading binmode DATA, ":encoding($enc)"; $stoptext = <DATA>; $textdata = <DATA>; if ( $stoptext =~ /\&#\d+;/ ) { # posting code on PM does this to + data s/\&#(\d+);/chr($1)/eg for ( $stoptext, $textdata ); } # so turn numeric character entities back into utf8 chara +cters } else { local $/; # undef = "slurp mode" for reading open( STOP, "<:encoding($enc)", $ARGV[0] ) or die "open failed for stoplist $ARGV[0]: $!\n"; $stoptext = <STOP>; close STOP; open( TEXT, "<:encoding($enc)", $ARGV[1] ) or die "open failed for textdata $ARGV[1]: $!\n"; $textdata = <TEXT>; close TEXT; } my %stopword = map { $_ => undef } ( split ' ', $stoptext ); for my $word ( split /[^\pL\pM]+/, $textdata ) { next if ( exists( $stopword{$word} )); print "$word\n"; } __DATA__ &#1601;&#1615;&#1608; &#1576;&#1614;&#1585; &#1576;&#1614;&#1586; &#1601;&#1614;&#1604;&#1615;&#1586;&#1616;&#1606; &#1576;&#1585;&#1604 +;&#1603;&#1608;&#1548; &#1601;&#1615;&#1608; &#1578;&#1616;&#1583;&#1 +616;&#1617;&#1604;&#1616;&#1610; &#1576;&#1614;&#1585;. &#1587;&#1615 +;&#1603;&#1615;&#1608;&#1606; &#1576;&#1614;&#1586; &#1605;&#1614;&#1604;&#1585;&#1616;&#1610;&#156 +7; &#1601;&#1615;&#1608;! &#1576;&#1614;&#1585;&#1548; &#1606;&#1614; +&#1583; &#1576;&#1614;&#1586; &#1605;&#1616;&#1587;.

Update: This ought to work in any language that uses white-space and punctuation to separate words, and it should work for any input encoding, provided that (a) the stop word list and text data are in the same language and same encoding, (b) you know how to identify the encoding, and (c) Encode supports it (a lot of encodings are supported, including all the Arabic ones).

Replies are listed 'Best First'.
Re^2: arabic alphabet ... how to deal with?
by Anonymous Monk on Feb 13, 2009 at 13:49 UTC
    Thanks a lot, this was indeed very very helpfull ... I could solve my problem ... Thanks again ...
      can i ask i have unicode string how to make it appear in arabic alphabets form?

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://743521]
[pryrt]: discipulus, were you naughty, or stopped believing in the vote fairy? She visited my vote bank last night....

How do I use this? | Other CB clients
Other Users?
Others examining the Monastery: (7)
As of 2017-06-24 18:38 GMT
Find Nodes?
    Voting Booth?
    How many monitors do you use while coding?

    Results (562 votes). Check out past polls.