http://www.perlmonks.org?node_id=117147

Kozz has asked for the wisdom of the Perl Monks concerning the following question:

I have a question regarding the Text::ParseWords module, which is standard with the latest versions of Perl. I've studied the perlman page on this module, but am still confused why it seems to "break" in my test-case. Here's what I wanted to do:
use Text::ParseWords; my $heredoc =<<END_OF_HEREDOC; The Text::ParseWords module is the most recent module with which I've struggled. END_OF_HEREDOC my @words = quotewords('\s+', 0, $heredoc); foreach(@words){ print $_ . "\n"; }
The problem seems to be that it outputs NOTHING ; no words at all. And the root of the problem lies in the single apostrophe in the string which is part of a conjuction. But if I escape the single-quote by putting a backslash in front of it, then everything is output fine. I could use a regex to place backslashes in front of apostrophes, but are these the only characters that cause this problem? Am I not quite using this module correctly? It seems silly to have to escape the characters -- why won't it just work?

Replies are listed 'Best First'.
Re: Text::ParseWords
by derby (Abbot) on Oct 06, 2001 at 05:22 UTC
    Brother Kozz,
    You have committed no sins, you're soul is clean. Looking at Text::Parsewords, the reqex is kinda of hairy and doesn't handle the case you've thrown at it. (my numbers)
    1. while (length($line)) { 2. ($quote, $quoted, undef, $unquoted, $delim, undef) = 3. $line =~ m/^(["']) # a $quote 4. ((?:\\.|(?!\1)[^\\])*) # and $quoted text 5. \1 # followed by the same quote 6. ([\000-\377]*) # and the rest 7. | # --OR-- 8. ^((?:\\.|[^\\"'])*?) # an $unquoted text 9. (\Z(?!\n)|(?-x:$delimiter)|(?!^)(?=["'])) # plus EOL, delimiter, or quote 10. ([\000-\377]*) # the rest 11. /x; # extended layout 12. return() unless( $quote || length($unquoted) || length($delim));
    In your case, we fail the first part of the regex (lines 3-5) because we only have a single quote. We also fail the second part of the regex (lines 8-10) because we do start with a quote. Since we've failed the match, $quote is undef, $unquoted is undef and $delim is undef. As a result of them all being undef, an empty list is returned.
    Maybe some of the more regex-y inclined brothers can think of a replacement regex. It hurts me just to read the ones like this.
    -derby
Re: Text::ParseWords
by lestrrat (Deacon) on Oct 06, 2001 at 05:49 UTC

    Ummm, you seem to have what's called an "Unmatched quote" here >> "I've"

    Yep, that single quote's going to get you. Try using the module without it. It worked for me (albeit slow.... perhaps because I have some ooold perl)