Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Re^3: UTF8 versus \w in pattern matching (basic test)

by LanX (Sage)
on Jul 06, 2021 at 12:56 UTC ( #11134707=note: print w/replies, xml ) Need Help??


in reply to Re^2: UTF8 versus \w in pattern matching (basic test)
in thread UTF8 versus \w in pattern matching

Please use Data::Dumper for basic debugging, like demonstrated.

Check your input, output and code.

We can't do this for you ...

Cheers Rolf
(addicted to the Perl Programming Language :)
Wikisyntax for the Monastery

  • Comment on Re^3: UTF8 versus \w in pattern matching (basic test)

Replies are listed 'Best First'.
Re^4: UTF8 versus \w in pattern matching (basic test)
by mldvx4 (Monk) on Jul 06, 2021 at 13:03 UTC
    Using Data::Dumper in the following,
    use utf8; use Data::Dumper; use strict; use warnings; my $a; $a = "/i/\N{LATIN SMALL LETTER A WITH ACUTE}\N{LATIN SMALL LETTER E WI +TH ACUTE}\N{LATIN SMALL LETTER I WITH ACUTE}\N{LATIN SMALL LETTER O W +ITH ACUTE}\N{LATIN SMALL LETTER U WITH ACUTE}z/pl"; print Dumper($a);
    I get this output:
    $VAR1 = "/i/\x{e1}\x{e9}\x{ed}\x{f3}\x{fa}z/pl";
      So that looks correct ...

      ...but as I said your input and output too.

      If you've not "fetched" the web data correctly, it will show in the dump.

      That's basic debugging!

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      Wikisyntax for the Monastery

      The Dumper output shows an encoding in ISO 8859-1, not UTF-8. That's strange.

      Greetings,
      -jo

      $gryYup$d0ylprbpriprrYpkJl2xyl~rzg??P~5lp2hyl0p$

        That's not strange. You're seeing Unicode codepoints, which for the characters in question happen to be identical to their ISO-8859-1 encodings. Add "\N{EURO SIGN}" to the string and you get "\x{20ac}": That's again the codepoint and no UTF-8 encoding.

        "Everything is UTF-8" is one of the most frequent false assumptions I encounter when dealing with non-ASCII characters.

        You didn't tell Perl to encode the output, so it didn't. The chars are being output unencoded. For example, a character with a value of E9 is output as E9. You are mistaking this lack of encoding for encoding using iso-8859-1.

        Seeking work! You can reach me at ikegami@adaelis.com

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11134707]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others taking refuge in the Monastery: (5)
As of 2021-09-22 06:15 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?