mldvx4 has asked for the wisdom of the Perl Monks concerning the following question:

I would like to have \w match even Unicode letters, but it is not doing so. With the following code,

#!/usr/bin/perl use strict; use warnings; my $a; $a = "/i/z/pl"; print qq(1: ),$a,qq(\n); ($a) = ($a =~ m/^([\/\p{Word}]+)/); print qq(2: ),$a,qq(\n); exit(0);

I get the following output,

1: /i/z/pl 2: /i/

What I expected and would like to find a way to achieve would be the following instead,

1: /i/z/pl 2: /i/z/pl

If I throw in a use utf8 then the letters with diacritical marks disappear also from the first line of output.

What have I missed?

Replies are listed 'Best First'.
Re: UTF8 versus \w in pattern matching
by Corion (Pope) on Jul 06, 2021 at 09:31 UTC

    Is your source file encoded as UTF-8?

    Personally, I prefer to use charnames; and then to use \N{...} escapes in my source code for non-ASCII constants:

    #!/usr/bin/perl use strict; use warnings; use charnames ':full'; binmode STDOUT, ':utf8'; my $a; $a = "/i/\N{LATIN SMALL LETTER A WITH ACUTE}\N{LATIN SMALL LETTER E WI +TH ACUTE}\N{LATIN SMALL LETTER I WITH ACUTE}\N{LATIN SMALL LETTER O W +ITH ACUTE}\N{LATIN SMALL LETTER U WITH ACUTE}z/pl"; print qq(1: ),$a,qq(\n); ($a) = ($a =~ m/^([\/\p{Word}]+)/); print qq(2: ),$a,qq(\n);

    This prints the following for me:

    1: /i/z/pl 2: /i/z/pl

      Swapping out the value of $a with "/i/\N{LATIN SMALL LETTER A WITH ACUTE}\N{LATIN SMALL LETTER E WITH ACUTE}\N{LATIN SMALL LETTER I WITH ACUTE}\N{LATIN SMALL LETTER O WITH ACUTE}\N{LATIN SMALL LETTER U WITH ACUTE}z/pl" gives me the same output as before on my setups, both with use utf8; and without: the diacriticals are not matched by \w either way. If it matters, it is perl 5, version 32, subversion 1 (v5.32.1) built for x86_64-linux-gnu-thread-multi on Ubuntu 21.04 on the one system and perl 5, version 28, subversion 1 (v5.28.1) built for arm-linux-gnueabihf-thread-multi-64int on Raspbian GNU/Linux 10 (buster).

        Time for some tests, then:

        use strict; use warnings; use Test::More tests => 2; my $str = "/i/\N{LATIN SMALL LETTER A WITH ACUTE}\N{LATIN SMALL LETTER + E WITH ACUTE}\N{LATIN SMALL LETTER I WITH ACUTE}\N{LATIN SMALL LETTE +R O WITH ACUTE}\N{LATIN SMALL LETTER U WITH ACUTE}\N{LATIN SMALL LETT +ER Y WITH ACUTE}z/pl"; my $re = qr/^([\/\p{Word}]+)/; like $str, $re, 'Matched'; $str =~ $re; is $1, $str, 'Capture group 1';

        Both pass here on v5.20.3 x86_64-linux-thread-multi.


        🦛

      Is your source file encoded as UTF-8?

      Yes, I am reading many UTF-8 files. As part of an earlier project, I have ensured that the input really is UTF-8. However, on two different systems, I get the problem that \w does not match any non-ASCII letters.

Re: UTF8 versus \w in pattern matching (basic test)
by LanX (Sage) on Jul 06, 2021 at 11:02 UTC
    Works for me.

    I'd say your file's encoding is not what you think it is.

    use strict; use warnings; use Data::Dumper; use utf8; my $str = " 1 i \x{3C3} _ "; # \x{3C3} = small sigma warn Dumper $str; $str =~ s/\w+//g; # delete all alpha-nums warn Dumper $str; warn "WORKS!" if $str =~ m/^ +$/;

    C:/Strawberry/perl/bin\perl.exe -w d:/tmp/pm/utf8.pl $VAR1 = " 1 i \x{e1} \x{3c3} _ "; $VAR1 = ' '; WORKS! at d:/tmp/pm/utf8.pl line 12.

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    Wikisyntax for the Monastery

    *) PM has problems displaying unicode characters like "σ" inside code tags

    update

      Thanks. That snippet works, as-is, but still the text I am getting does not. The data is fetched over HTTP from WordPress. If I save the file and run the 'file' utility, I get the output "HTML document, UTF-8 Unicode text" for everything. Yet, when I process the file with perl, the \w pattern misses non-ASCII letters.

        How do you fetch and process the file? Your original code example has no use utf8; and does not UTF-8-encode the output. You get your original string only because of a cancellation of errors:
        • Your file is UTF-8-encoded but you don't declare this to Perl. Perl reads the individual bytes of the UTF-8-encoding which are no word characters and thus won't match \w.
        • You just print the bytes. If you are using a UTF-8 terminal, this "works" because the terminal decodes your bytes.

        Perl's default encoding is not UTF-8. If you read the file and decode it from UTF-8 you should be fine. If you fetch with LWP, you can either print $response->content (without encoding it) or encode $response->decoded_content before printing.

        Please use Data::Dumper for basic debugging, like demonstrated.

        Check your input, output and code.

        We can't do this for you ...

        Cheers Rolf
        (addicted to the Perl Programming Language :)
        Wikisyntax for the Monastery

Re: UTF8 versus \w in pattern matching
by jo37 (Hermit) on Jul 06, 2021 at 09:30 UTC

    Unable to reproduce: With use utf8; the second line of output has all the diacritical characters - as well as the first line.

    Greetings,
    -jo

    $gryYup$d0ylprbpriprrYpkJl2xyl~rzg??P~5lp2hyl0p$
Re: UTF8 versus \w in pattern matching
by ikegami (Pope) on Jul 06, 2021 at 20:36 UTC

    Problem #1: You didn't tell Perl the source file is encoded using use utf8;.

    Problem #2: You didn't tell Perl how to encode the output for your terminal using something like use open ':std', ':encoding(UTF-8)';.

    Finally, you mention \w. Because of a bug, \w doesn't always match characters in the U+7F..U+FF range. This bug is fixed with use 5.014;. That said, you actually used \p{Word}, which isn't affected by this bug.

    Seeking work! You can reach me at ikegami@adaelis.com

      Thanks. That part about the U+7F..U+FF range explains things. I see that m/^([\/\-\_\.\p{Word}\x7f-\xff]+)$/ matches, and m/^([\/\-\_\.\p{Word}]+)$/ does not. I presume that is because the data upstream might really be ISO-8859-15 and not UTF-8? Should I try to convert the U+7F..U+FF range into UTF-8 before further processing? If so, how?

        Read again. The part about U+7F..U+FF applies to \w, but not to \p{Word}.

        # Sometimes \w matches U+E9. $ perl -Mfeature=say -e'say "\xE9" =~ /^\w/ || 0' 0 # Sometimes it doesn't. $ perl -Mfeature=say -e'say "\xE9\x{2660}" =~ /^\w/ || 0' 1 # \w always matches U+E9 with "use 5.014;". $ perl -Mfeature=say -e'use 5.014; say "\xE9" =~ /^\w/ || 0' 1 # \p{Word} always matches U+E9, period. $ perl -Mfeature=say -e'say "\xE9" =~ /^\p{Word}/ || 0' 1

        Your questions make absolutely no sense since you're not using \w. And I said the fix for that bug was to add use 5.014;, not to convert the input.

        Seeking work! You can reach me at ikegami@adaelis.com