in reply to Re^3: UTF8 versus \w in pattern matching (basic test)
in thread UTF8 versus \w in pattern matching

Using Data::Dumper in the following,
use utf8; use Data::Dumper; use strict; use warnings; my $a; $a = "/i/\N{LATIN SMALL LETTER A WITH ACUTE}\N{LATIN SMALL LETTER E WI +TH ACUTE}\N{LATIN SMALL LETTER I WITH ACUTE}\N{LATIN SMALL LETTER O W +ITH ACUTE}\N{LATIN SMALL LETTER U WITH ACUTE}z/pl"; print Dumper($a);
I get this output:
$VAR1 = "/i/\x{e1}\x{e9}\x{ed}\x{f3}\x{fa}z/pl";

Replies are listed 'Best First'.
Re^5: UTF8 versus \w in pattern matching (basic test)
by LanX (Sage) on Jul 06, 2021 at 15:23 UTC
    So that looks correct ...

    ...but as I said your input and output too.

    If you've not "fetched" the web data correctly, it will show in the dump.

    That's basic debugging!

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    Wikisyntax for the Monastery

Re^5: UTF8 versus \w in pattern matching (basic test)
by jo37 (Hermit) on Jul 06, 2021 at 16:18 UTC

    The Dumper output shows an encoding in ISO 8859-1, not UTF-8. That's strange.

    Greetings,
    -jo

    $gryYup$d0ylprbpriprrYpkJl2xyl~rzg??P~5lp2hyl0p$

      That's not strange. You're seeing Unicode codepoints, which for the characters in question happen to be identical to their ISO-8859-1 encodings. Add "\N{EURO SIGN}" to the string and you get "\x{20ac}": That's again the codepoint and no UTF-8 encoding.

      "Everything is UTF-8" is one of the most frequent false assumptions I encounter when dealing with non-ASCII characters.

        Thanks for the clarification.

        Greetings,
        -jo

        $gryYup$d0ylprbpriprrYpkJl2xyl~rzg??P~5lp2hyl0p$

      You didn't tell Perl to encode the output, so it didn't. The chars are being output unencoded. For example, a character with a value of E9 is output as E9. You are mistaking this lack of encoding for encoding using iso-8859-1.

      Seeking work! You can reach me at ikegami@adaelis.com