Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

Re^4: UTF8 versus \w in pattern matching (basic test)

by mldvx4 (Friar)
on Jul 06, 2021 at 13:03 UTC ( [id://11134708]=note: print w/replies, xml ) Need Help??


in reply to Re^3: UTF8 versus \w in pattern matching (basic test)
in thread UTF8 versus \w in pattern matching

Using Data::Dumper in the following,
use utf8; use Data::Dumper; use strict; use warnings; my $a; $a = "/i/\N{LATIN SMALL LETTER A WITH ACUTE}\N{LATIN SMALL LETTER E WI +TH ACUTE}\N{LATIN SMALL LETTER I WITH ACUTE}\N{LATIN SMALL LETTER O W +ITH ACUTE}\N{LATIN SMALL LETTER U WITH ACUTE}z/pl"; print Dumper($a);
I get this output:
$VAR1 = "/i/\x{e1}\x{e9}\x{ed}\x{f3}\x{fa}z/pl";

Replies are listed 'Best First'.
Re^5: UTF8 versus \w in pattern matching (basic test)
by LanX (Saint) on Jul 06, 2021 at 15:23 UTC
    So that looks correct ...

    ...but as I said your input and output too.

    If you've not "fetched" the web data correctly, it will show in the dump.

    That's basic debugging!

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    Wikisyntax for the Monastery

Re^5: UTF8 versus \w in pattern matching (basic test)
by jo37 (Deacon) on Jul 06, 2021 at 16:18 UTC

    The Dumper output shows an encoding in ISO 8859-1, not UTF-8. That's strange.

    Greetings,
    -jo

    $gryYup$d0ylprbpriprrYpkJl2xyl~rzg??P~5lp2hyl0p$

      That's not strange. You're seeing Unicode codepoints, which for the characters in question happen to be identical to their ISO-8859-1 encodings. Add "\N{EURO SIGN}" to the string and you get "\x{20ac}": That's again the codepoint and no UTF-8 encoding.

      "Everything is UTF-8" is one of the most frequent false assumptions I encounter when dealing with non-ASCII characters.

        Thanks for the clarification.

        Greetings,
        -jo

        $gryYup$d0ylprbpriprrYpkJl2xyl~rzg??P~5lp2hyl0p$

      You didn't tell Perl to encode the output, so it didn't. The chars are being output unencoded. For example, a character with a value of E9 is output as E9. You are mistaking this lack of encoding for encoding using iso-8859-1.

      Seeking work! You can reach me at ikegami@adaelis.com

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11134708]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others wandering the Monastery: (4)
As of 2024-04-26 08:23 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found