in reply to Re^4: UTF8 versus \w in pattern matching (basic test)
in thread UTF8 versus \w in pattern matching

The Dumper output shows an encoding in ISO 8859-1, not UTF-8. That's strange.

Greetings,
-jo

$gryYup$d0ylprbpriprrYpkJl2xyl~rzg??P~5lp2hyl0p$

Replies are listed 'Best First'.
Re^6: UTF8 versus \w in pattern matching (basic test)
by haj (Curate) on Jul 06, 2021 at 17:54 UTC

    That's not strange. You're seeing Unicode codepoints, which for the characters in question happen to be identical to their ISO-8859-1 encodings. Add "\N{EURO SIGN}" to the string and you get "\x{20ac}": That's again the codepoint and no UTF-8 encoding.

    "Everything is UTF-8" is one of the most frequent false assumptions I encounter when dealing with non-ASCII characters.

      Thanks for the clarification.

      Greetings,
      -jo

      $gryYup$d0ylprbpriprrYpkJl2xyl~rzg??P~5lp2hyl0p$
Re^6: UTF8 versus \w in pattern matching (basic test)
by ikegami (Pope) on Jul 06, 2021 at 21:07 UTC

    You didn't tell Perl to encode the output, so it didn't. The chars are being output unencoded. For example, a character with a value of E9 is output as E9. You are mistaking this lack of encoding for encoding using iso-8859-1.

    Seeking work! You can reach me at ikegami@adaelis.com