Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation

Comment on

( #3333=superdoc: print w/replies, xml ) Need Help??
I agree strongly with everything moritz said. Also, I'm surprised that no one has recommended that you use a hash for your stop-word list, because that will make things a lot simpler (and speed things up as well). And I'm baffled by the fact that you are using the "lc()" function, considering that case distinctions do not exist at all in Arabic letters.

You should look around for some tools that will help you get better acquainted with your data files, and the characters that they're made of. If the stop word list and data files are really utf8, you might want to run them through a couple scripts that have been posted here at the monastery: tlu -- TransLiterate Unicode and unichist -- count/summarize characters in data. They can help you to check whether the files are really utf8, whether they are really comparable, and whether they might have any strange, unexpected properties.

Anyway, try this out, and if "it doesn't work", you will need to explain and/or show evidence to clarify exactly how it fails. I'm including some test data, and you should be able to run it yourself in "test mode" to confirm that it works on the test data, so if it doesn't work on your stop list and data file, the problem is with your data (or your misunderstanding of the data), not with the code. (The stop list and text data I provided for testing are pure nonsense, of course, because I don't know any Persian, but it's all in utf8 Arabic letters and diacritics, and even some punctuation. And since the PM code tags force wide characters into numeric entities, I added some code to convert the test data back into utf8 characters.)

#!/usr/bin/perl =head1 NAME stopword-filter =head1 SYNOPSIS stopword-filter [-e encoding] stop.list text.file stopword-filter -t # (runs a simple test on internal utf8 data) =head1 DESCRIPTION The stop.list file should contain a set of white-space-separated words that should be removed from the text file. The remaining words in the text file (after splitting on non-letter/non-mark characters and remov +ing stop words) will be printed to STDOUT, one word per line. The two files need to have the same character encoding, and STDOUT will be in that same encoding. The default encoding is utf8. =cut use strict; use warnings; use Getopt::Std; my %opt; my $Usage = "Usage: $0 -t # (to test)\n or: $0 [-e enc] stop.list t +ext.file\n"; getopts( 'e:t', \%opt ) and ( $opt{t} || @ARGV == 2 ) or die $Usage; my ( $stoptext, $textdata ); my $enc = $opt{e} || 'utf8'; binmode STDOUT, ":encoding($enc)"; if ( $opt{t} ) { local $/ = ""; # empty string = "paragraph mode" for reading binmode DATA, ":encoding($enc)"; $stoptext = <DATA>; $textdata = <DATA>; if ( $stoptext =~ /\&#\d+;/ ) { # posting code on PM does this to + data s/\&#(\d+);/chr($1)/eg for ( $stoptext, $textdata ); } # so turn numeric character entities back into utf8 chara +cters } else { local $/; # undef = "slurp mode" for reading open( STOP, "<:encoding($enc)", $ARGV[0] ) or die "open failed for stoplist $ARGV[0]: $!\n"; $stoptext = <STOP>; close STOP; open( TEXT, "<:encoding($enc)", $ARGV[1] ) or die "open failed for textdata $ARGV[1]: $!\n"; $textdata = <TEXT>; close TEXT; } my %stopword = map { $_ => undef } ( split ' ', $stoptext ); for my $word ( split /[^\pL\pM]+/, $textdata ) { next if ( exists( $stopword{$word} )); print "$word\n"; } __DATA__ &#1601;&#1615;&#1608; &#1576;&#1614;&#1585; &#1576;&#1614;&#1586; &#1601;&#1614;&#1604;&#1615;&#1586;&#1616;&#1606; &#1576;&#1585;&#1604 +;&#1603;&#1608;&#1548; &#1601;&#1615;&#1608; &#1578;&#1616;&#1583;&#1 +616;&#1617;&#1604;&#1616;&#1610; &#1576;&#1614;&#1585;. &#1587;&#1615 +;&#1603;&#1615;&#1608;&#1606; &#1576;&#1614;&#1586; &#1605;&#1614;&#1604;&#1585;&#1616;&#1610;&#156 +7; &#1601;&#1615;&#1608;! &#1576;&#1614;&#1585;&#1548; &#1606;&#1614; +&#1583; &#1576;&#1614;&#1586; &#1605;&#1616;&#1587;.

Update: This ought to work in any language that uses white-space and punctuation to separate words, and it should work for any input encoding, provided that (a) the stop word list and text data are in the same language and same encoding, (b) you know how to identify the encoding, and (c) Encode supports it (a lot of encodings are supported, including all the Arabic ones).

In reply to Re: arabic alphabet ... how to deal with? by graff
in thread arabic alphabet ... how to deal with? by Anonymous Monk

Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":

  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?

    What's my password?
    Create A New User
    and all is quiet...

    How do I use this? | Other CB clients
    Other Users?
    Others surveying the Monastery: (6)
    As of 2018-06-22 18:37 GMT
    Find Nodes?
      Voting Booth?
      Should cpanminus be part of the standard Perl release?

      Results (124 votes). Check out past polls.