in reply to Unpredictable output from Text::ExtractWords
I had a quick look at the code out of interest. The problem occurs in the str_normalize function. The code is buggy and it does not check boundary conditions properly.
Below is the line of the code that caused the problem.
My advise is to ditch this module, and use simple Perl implementation. The single line below does pretty much what the ExtractWords.xs code does in hundred lines of C.
Below is the line of the code that caused the problem.
The code detects a possible space character, and tries to validate it. But the single character 'I' and the spaces surrounding it caused logical malfunction in the space_words function. Thus <space> is not recognized as separator for the words.if(isalpha(*(s-1)) && strchr(chrsep, *s) && isalpha(*(s+1))) { if(space_words(s, *s)) { char c = *s; while(*s) { if(*s == c) s++; else if(!isalpha(*s)) break; *p = *s; s++; p++; } } }
My advise is to ditch this module, and use simple Perl implementation. The single line below does pretty much what the ExtractWords.xs code does in hundred lines of C.
And the output is#!/usr/bin/perl use strict; use warnings; use Data::Dumper; my @strings = ( q|one two 'three' one|, q|yes i am 'sme'|, q|x21 y z|, ); for my $str (@strings){ my %hash; for ( map { length >= 3 ? $_ : () } $str =~ /(\w+)/g ) { $hash{$_}++; } print Dumper(\%hash); }
$VAR1 = { 'three' => 1, 'one' => 2, 'two' => 1 }; $VAR1 = { 'yes' => 1, 'sme' => 1 }; $VAR1 = { 'x21' => 1 };
|
---|
In Section
Seekers of Perl Wisdom