Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

Unpredictable output from Text::ExtractWords

by wfsp (Abbot)
on Aug 16, 2005 at 12:14 UTC ( [id://484134]=perlquestion: print w/replies, xml ) Need Help??

wfsp has asked for the wisdom of the Perl Monks concerning the following question:

winXP home, AS 5.8.7, T::EW v0.07

I've found this module very useful but have come across what appears to be an inconsistancy. Can anyone confirm my output or identify what I'm doing wrong?

#!/usr/bin/perl use strict; use warnings; use Text::ExtractWords qw(words_count); my @strings = ( q|one two 'three'|, q|yes I am 'me'|, ); my %config = ( minlen => 3, maxlen => 32, locale => "en_US.ISO_8859-1", ); for my $str (@strings){ print "$str\n"; my %hash; words_count(\%hash, $str, \%config); for my $key (sort keys %hash){ print "$key -> $hash{$key}\n"; } print '-' x 10, "\n"; } __DATA__ ---------- Capture Output ---------- > "C:\Perl\bin\perl.exe" text_extractwords.pl one two 'three' one -> 1 three -> 1 two -> 1 ---------- yes I am 'me' yesiam'me -> 1 ---------- > Terminated with exit code 0.

Thanks in advance
John

Replies are listed 'Best First'.
Re: Unpredictable output form Text::ExtractWords
by Roger (Parson) on Aug 16, 2005 at 14:44 UTC
    I had a quick look at the code out of interest. The problem occurs in the str_normalize function. The code is buggy and it does not check boundary conditions properly.

    Below is the line of the code that caused the problem.

    if(isalpha(*(s-1)) && strchr(chrsep, *s) && isalpha(*(s+1))) { if(space_words(s, *s)) { char c = *s; while(*s) { if(*s == c) s++; else if(!isalpha(*s)) break; *p = *s; s++; p++; } } }
    The code detects a possible space character, and tries to validate it. But the single character 'I' and the spaces surrounding it caused logical malfunction in the space_words function. Thus <space> is not recognized as separator for the words.

    My advise is to ditch this module, and use simple Perl implementation. The single line below does pretty much what the ExtractWords.xs code does in hundred lines of C.
    #!/usr/bin/perl use strict; use warnings; use Data::Dumper; my @strings = ( q|one two 'three' one|, q|yes i am 'sme'|, q|x21 y z|, ); for my $str (@strings){ my %hash; for ( map { length >= 3 ? $_ : () } $str =~ /(\w+)/g ) { $hash{$_}++; } print Dumper(\%hash); }
    And the output is
    $VAR1 = { 'three' => 1, 'one' => 2, 'two' => 1 }; $VAR1 = { 'yes' => 1, 'sme' => 1 }; $VAR1 = { 'x21' => 1 };

Re: Unpredictable output form Text::ExtractWords
by gjb (Vicar) on Aug 16, 2005 at 12:40 UTC

    I've run your code on the cygwin Perl v5.8.7 built for cygwin-thread-multi-64int on Windows XP Pro with similar results. I've also tried fiddling with the config parameters, all to no avail.

    Looking at the documentation I get suspicious since basically there's practically none. The code is implemented in C and frankly, I don't feel like going through it now.

    I'd recommend you submit a bug report to the author of Text::ExtractWords and meanwhile resort to some other module like cpan:/String::Tokenizer if that's fast enough for you.

    Hope this helps, -gjb-

Re: Unpredictable output form Text::ExtractWords
by socketdave (Curate) on Aug 16, 2005 at 13:51 UTC
    I can duplicate this behavior on 5.8.7 on Gentoo. It looks like 'I am' is the cause. Case doesn't matter, but white space does. Two spaces before, after, or between these words corrects the problem. I didn't spot anything obvious in the source for the module, but I'm japh, not a C guy.
Re: Unpredictable output form Text::ExtractWords
by epoptai (Curate) on Aug 16, 2005 at 15:37 UTC
    Lingua::EN::Splitter to the rescue:
    #!/usr/bin/perl use strict; use warnings; use Lingua::EN::Splitter; my @strings = ( q|one two 'three'|, q|yes I am 'me'|, ); my %config = ( minlen => 3, maxlen => 32, locale => "en_US.ISO_8859-1", ); my $split = Lingua::EN::Splitter->new; my %count; for my $str (@strings){ print "$str\n"; my $words = $split->words($str); $count{$_}++ for @$words; for my $word (sort @$words) { $_ = length $word; next if $_ < $config{minlen} or $_ > $config{maxlen}; print $word, ' -> ', $count{$word}, "\n" } print '-' x 10, "\n"; } __END__ Output: one two 'three' one -> 1 three -> 1 two -> 1 ---------- yes I am 'me' yes -> 1

    --
    perl -MO=Deparse -e"u j t S n a t o e h r , e p l r a h k c r e"

Re: Unpredictable output form Text::ExtractWords
by extremely (Priest) on Aug 16, 2005 at 17:14 UTC
    Change your string list to this:
    my @strings = ( q|a ab abc abcd abcde|, q|abcde abcd abc ab a|, q|abcde a abc ab abcd|, q|yyy xzz wwww|, q|one two 'three'|, );
    The C code strips single letter words, 2 letter words, "words" of repeating characters as non-text. Probably not very helpful. It is also buggy in handling the stripping as someone else mentioned so pick a better module! :/

    --
    $you = new YOU;
    honk() if $you->love(perl)

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://484134]
Approved by Corion
Front-paged by planetscape
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chilling in the Monastery: (4)
As of 2024-03-19 11:42 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found