Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer

arabic alphabet ... how to deal with?

by Anonymous Monk
on Feb 12, 2009 at 15:13 UTC ( #743338=perlquestion: print w/replies, xml ) Need Help??
Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Dear monks I have a Persian text and the list of stop words, I would like to remove all stop words but the results is not satisfactory. here is my code:
open (STOPWORDS, $ARGV[1]) || die "Error opening the stopwords file\n" +; $count = 0; while ($word = <STOPWORDS>) { chop($word); $stopword[$count] = lc($word); $count++; } close(STOPWORDS); open (INFILE, $ARGV[0]) || die "Error opening the input file\n"; while ($line = <INFILE>) { chop($line); @entry = split(/ /, $line); $i = 0; while ($entry[$i]) { $found = 0; $j = 0; while (($j<=$count) && ($found==0)) { if (lc($entry[$i]) eq $stopword[$j]) { $found = 1; } $j++; } if ($found == 0) { print "$entry[$i]\n"; } $i++; } } close(INFILE);
I cant put sample of my stop word list since it doesnt appear here, its one word per line and my input text is not tokenized and its just a raw uni-code text. any idea how can i make it work? Thanks in advance.

Replies are listed 'Best First'.
Re: arabic alphabet ... how to deal with?
by graff (Chancellor) on Feb 13, 2009 at 04:14 UTC
    I agree strongly with everything moritz said. Also, I'm surprised that no one has recommended that you use a hash for your stop-word list, because that will make things a lot simpler (and speed things up as well). And I'm baffled by the fact that you are using the "lc()" function, considering that case distinctions do not exist at all in Arabic letters.

    You should look around for some tools that will help you get better acquainted with your data files, and the characters that they're made of. If the stop word list and data files are really utf8, you might want to run them through a couple scripts that have been posted here at the monastery: tlu -- TransLiterate Unicode and unichist -- count/summarize characters in data. They can help you to check whether the files are really utf8, whether they are really comparable, and whether they might have any strange, unexpected properties.

    Anyway, try this out, and if "it doesn't work", you will need to explain and/or show evidence to clarify exactly how it fails. I'm including some test data, and you should be able to run it yourself in "test mode" to confirm that it works on the test data, so if it doesn't work on your stop list and data file, the problem is with your data (or your misunderstanding of the data), not with the code. (The stop list and text data I provided for testing are pure nonsense, of course, because I don't know any Persian, but it's all in utf8 Arabic letters and diacritics, and even some punctuation. And since the PM code tags force wide characters into numeric entities, I added some code to convert the test data back into utf8 characters.)

    #!/usr/bin/perl =head1 NAME stopword-filter =head1 SYNOPSIS stopword-filter [-e encoding] stop.list text.file stopword-filter -t # (runs a simple test on internal utf8 data) =head1 DESCRIPTION The stop.list file should contain a set of white-space-separated words that should be removed from the text file. The remaining words in the text file (after splitting on non-letter/non-mark characters and remov +ing stop words) will be printed to STDOUT, one word per line. The two files need to have the same character encoding, and STDOUT will be in that same encoding. The default encoding is utf8. =cut use strict; use warnings; use Getopt::Std; my %opt; my $Usage = "Usage: $0 -t # (to test)\n or: $0 [-e enc] stop.list t +ext.file\n"; getopts( 'e:t', \%opt ) and ( $opt{t} || @ARGV == 2 ) or die $Usage; my ( $stoptext, $textdata ); my $enc = $opt{e} || 'utf8'; binmode STDOUT, ":encoding($enc)"; if ( $opt{t} ) { local $/ = ""; # empty string = "paragraph mode" for reading binmode DATA, ":encoding($enc)"; $stoptext = <DATA>; $textdata = <DATA>; if ( $stoptext =~ /\&#\d+;/ ) { # posting code on PM does this to + data s/\&#(\d+);/chr($1)/eg for ( $stoptext, $textdata ); } # so turn numeric character entities back into utf8 chara +cters } else { local $/; # undef = "slurp mode" for reading open( STOP, "<:encoding($enc)", $ARGV[0] ) or die "open failed for stoplist $ARGV[0]: $!\n"; $stoptext = <STOP>; close STOP; open( TEXT, "<:encoding($enc)", $ARGV[1] ) or die "open failed for textdata $ARGV[1]: $!\n"; $textdata = <TEXT>; close TEXT; } my %stopword = map { $_ => undef } ( split ' ', $stoptext ); for my $word ( split /[^\pL\pM]+/, $textdata ) { next if ( exists( $stopword{$word} )); print "$word\n"; } __DATA__ &#1601;&#1615;&#1608; &#1576;&#1614;&#1585; &#1576;&#1614;&#1586; &#1601;&#1614;&#1604;&#1615;&#1586;&#1616;&#1606; &#1576;&#1585;&#1604 +;&#1603;&#1608;&#1548; &#1601;&#1615;&#1608; &#1578;&#1616;&#1583;&#1 +616;&#1617;&#1604;&#1616;&#1610; &#1576;&#1614;&#1585;. &#1587;&#1615 +;&#1603;&#1615;&#1608;&#1606; &#1576;&#1614;&#1586; &#1605;&#1614;&#1604;&#1585;&#1616;&#1610;&#156 +7; &#1601;&#1615;&#1608;! &#1576;&#1614;&#1585;&#1548; &#1606;&#1614; +&#1583; &#1576;&#1614;&#1586; &#1605;&#1616;&#1587;.

    Update: This ought to work in any language that uses white-space and punctuation to separate words, and it should work for any input encoding, provided that (a) the stop word list and text data are in the same language and same encoding, (b) you know how to identify the encoding, and (c) Encode supports it (a lot of encodings are supported, including all the Arabic ones).

      Thanks a lot, this was indeed very very helpfull ... I could solve my problem ... Thanks again ...
        can i ask i have unicode string how to make it appear in arabic alphabets form?
Re: arabic alphabet ... how to deal with?
by kennethk (Abbot) on Feb 12, 2009 at 16:41 UTC
    Read through perlunicode. All your I/O operations need to be performed in UTF-8. That means not only open (STOPWORDS, '<:encoding(UTF-8)', $ARGV[1]) as ForgotPasswordAgain suggests and open (INFILE, '<:encoding(UTF-8)', $ARGV[0]) as derby suggests, but also binmode STDOUT, ":encoding(utf8)" before you try to print. The fact that it works with "standard" text says it is almost guaranteed to be a Unicode problem.
      I tried this way as well before, this way no output ;)
      #!/usr/bin/perl open (STOPWORDS, '<:encoding(UTF-8)', $ARGV[1]) || die "Error opening +the stopwords file\n"; $count = 0; while ($word = <STOPWORDS>) { chop($word); $stopword[$count] = lc($word); $count++; } close(STOPWORDS); open (INFILE ,'<:encoding(UTF-8)', $ARGV[0]) || die "Error opening the + input file\n"; while ($line = <INFILE>) { chop($line); @entry = split(/ /, $line); $i = 0; while ($entry[$i]) { $found = 0; $j = 0; while (($j<=$count) && ($found==0)) { if (lc($entry[$i]) eq $stopword[$j]) { $found = 1; } $j++; } if ($found == 0) { print FH "$entry[$i]\n"; } $i++; } } close(INFILE);
        In this case, you have an orphaned file handle FH which is never associated with a file or channel.

        Use Devel::Peek to get an ASCII-printable representation of the strings you're comparing, and then verify that what you think should match is in fact identical:

        use Devel::Peek; ... Dump lc($entry[$i]); Dump $stopword[$j]; if (lc($entry[$i]) eq $stopword[$j]) { ...
Re: arabic alphabet ... how to deal with?
by moritz (Cardinal) on Feb 12, 2009 at 18:28 UTC
    There are some possibilities of what can go wrong:
    • Character encoding. Others have commented on that, so I'll keep it brief: do you know exactly in which character encodings your two text files are? if so, is that source of information reliable? You can't work with text if you don't know how it's encoded, so you need to know for sure.
    • Data format: Are you sure that all files have the same line endings? and that they don't contain any non-printable characters that make your comparisons go awry? For Arabic text it might well be that it contains bidi-markers that your text editor doesn't show, but Perl might become confused by them
    • Normalization: If I'm informed correctly, the Arabic script makes heavy use of diacritic marks. That means that many characters have multiple representations (pre-composed and separate base character/diacritics). In that case Unicode::Normalize can help you
    • Cultural misunderstanding: If you don't know the script and the language, it might be that things that look identical to you actually vary in small details, and the words you want to remove aren't actually part of your input data.
Re: arabic alphabet ... how to deal with?
by ForgotPasswordAgain (Deacon) on Feb 12, 2009 at 15:49 UTC

    Did you try printing out the stop words? You probably have to read it in utf8 mode:

    open (STOPWORDS, '<:encoding(UTF-8)', $ARGV[1])
      I did but still does not work ...
        What about the infile? That needs to be opened UTF-8 as well.
Re: arabic alphabet ... how to deal with?
by JavaFan (Canon) on Feb 12, 2009 at 15:17 UTC
    Perhaps you can explain what it does, and what you want it to do. Now it's just a bunch of code and "it doesn't work". I'm not going to spend time to invent some sample data to see what it does, and from that deduce what you want.
      my script suppose to remove all stop words that i have in a file (second argument) from the text (first argument) and output the filtered text one word per line ...
        Goodie. One step down. Several more to go. You get one more chance. What does your blob of code currently do?
Re: arabic alphabet ... how to deal with?
by Anonymous Monk on Feb 12, 2009 at 15:25 UTC
    You probably want chomp not chop to remove a newline from your stop words.
      it works for English well, with chop as well, but the problem is working with Arabic script (Farsi)
        OK, As you say that's not the problem you have, but chomp will only remove newlines ($/) whereas chop will remove any last character.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://743338]
Approved by Corion
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others exploiting the Monastery: (7)
As of 2018-05-22 17:01 GMT
Find Nodes?
    Voting Booth?