Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

Re: tr{}{} doesn't wanna work.. what am I doing wrong?

by moritz (Cardinal)
on Feb 24, 2012 at 13:31 UTC ( [id://955919]=note: print w/replies, xml ) Need Help??


in reply to tr{}{} doesn't wanna work.. what am I doing wrong?

In my experience, tr and Unicode don't mix well. Here's my approach (source code stored in UTF-8 encoding):
use strict; use warnings; use utf8; use 5.010; use Unicode::Normalize qw/NFKD/; binmode STDOUT, ':encoding(UTF-8)'; sub frob { my $str = NFKD(shift); $str =~ s/\pM//g; $str =~ s/[^a-z-A-z0-9]/_/g; $str; } my $test = '&[]ÀÂÄàâäÇçÉÊÈËéêèëÏÌÎïìîÖÔÒöôòÜÛÙüûù?!;«»()" íóñÑáéóúÁÉÍÓ +Ú'; say frob $test; __END__ []AAAaaaCcEEEEeeeeIIIiiiOOOoooUUUuuu_________ionNaeouAEIOU

Update: Since several people misunderstood me, I feel I should clarify. I wrote that in my experience, Unicode and tr/// don't mix. Which is to say that tr/// isn't buggy, but I haven't encountered any code in the wild that correctly handles Unicode strings with tr///, because tr wasn't designed with Unicode in mind.

Replies are listed 'Best First'.
Re^2: tr{}{} doesn't wanna work.. what am I doing wrong?
by Eliya (Vicar) on Feb 24, 2012 at 14:19 UTC
    tr and Unicode don't mix well

    In what way?  Seems to work fine for me.  Could you provide an example that fails? (just curious)

    use Devel::Peek; my $test = "\x{2345}\x{3456}"; Dump $test; $test =~ tr/\x{2345}\x{3456}/XY/; Dump $test; __END__ SV = PV(0x768bc8) at 0x7907d8 REFCNT = 1 FLAGS = (PADMY,POK,pPOK,UTF8) PV = 0x782630 "\342\215\205\343\221\226"\0 [UTF8 "\x{2345}\x{3456}"] CUR = 6 LEN = 8 SV = PV(0x768bc8) at 0x7907d8 REFCNT = 1 FLAGS = (PADMY,POK,pPOK,UTF8) PV = 0x782630 "XY"\0 [UTF8 "XY"] CUR = 2 LEN = 8

    (I'm only using \x{...} here because PM code sections don't support Unicode — it works the same way with a UTF-8 encoded source file when using "use utf8;")

      In what way?

      By not supporting Unicode-aware character classes, and listing all Unicode characters in a certain category is a usually a moot endeavor.

      The OP is the best example: it doesn't list all accented characters that could be ASCIIfied.

        By not supporting Unicode-aware character classes

        Well, tr/// doesn't support character classes in general (only certain kind of ranges), so this is not specifically a Unicode problem, but a feature of tr///.   (I'd agree if you had said "tr and character classes don't mix well"...)

        What you're pointing out is kind of a different problem, i.e. doing sanitization based on picking out an incomplete list of individual characters as opposed to using a catch-all character class.

        >>> tr and Unicode don't mix well
        >> In what way? Seems to work fine for me.
        > By not supporting Unicode-aware character classes,
        > and listing all Unicode characters in a certain category
        > is a usually a moot endeavor.

        > The OP is the best example: it doesn't list all accented
        > characters that could be ASCIIfied.

        The original statement — that tr/// and Unicode don’t mix well — is FUD-raking nonsense. It’s baseless fear, uncertainty, and doubt, and we don’t need it.

        As for character classes, since tr/// never worked on character classes before back in caveman-ASCII, it is a strawman to complain that it doesn’t work on them now.

        Finally, the idea that there exists a such thing as an “accented character”, or that these can be meaningfully “ASCII-fied”, does not hold up.

        • How do you convert a £10-pound note or a 5¢-coin to ASCII?
        • How do you convert Ævar Arnfjörð Bjarmason to ASCII?
        • How do you convert φ ≠ π to ASCII?
        • How do you convert /ɪntɚˈnæʃənəl/ to ASCII?
        • How do you convert ♲ ♳ ♴ ♵ ♶ ♷ ♸ ♹ ♺ ♻ ♼ ♽ to ASCII?
        • How do you convert 👪 💗 🐪 to ASCII?
        • How do you convert my $ʇndʇno = uʍopəpᴉsdn($input) to ASCII?
        • How do you convert Allerød or ψ-ionone or 「文字化け」 to ASCII?
        • How do you convert ♀♂🜫⚩⚥ 🜭⚧🜥🜠⚨⚣🜤🜧🜦🜟⚤🜜⚦🜡⚢🜪 to ASCII?

        More importantly, why in the world do you want to? You can’t put the djinn back in the bottle and go back to a Beaver Cleaver world of a 52-character Latin alphabet that never existed in the first place. Even Gutenberg has 230 sorts, and he was the very first printer for heaven’s sake! If we cannot do at least as well as the very first printer from half a millennium ago, what does that say about us?

        I can only repeat the Bringhurst quote: The fact that such a character set was long considered adequate tells us something about the cultural narrowness of American civilization, or American technocracy, in the midst of twentieth century.

        Guess what? Unlike Beaver Cleaver himself, we are no longer in the midst of the twentieth century, so why should strive to recreate that Neverland that never was?

        I say that we’re better than that, and I’m proud of that fact. To see such obvious Ludditism amongst soi-disant technologists is very troubling. What sort of example are we setting for the future? /small

Re^2: tr{}{} doesn't wanna work.. what am I doing wrong?
by jhourcle (Prior) on Feb 24, 2012 at 14:25 UTC

    I'd recommend this approach because no matter what you put into your list, you're going to miss a character, especially as new characters are added (who cared about € 14 years ago?) What happens if someone inserts Å? á? Ø? Æ? ð ? þ ? — ? „ ? Kanji? Mathematical symbols?

    You're *much* better off listing the characters that you want to keep, and removing all others, if only because it's less work to maintain in the long run, as you don't have to worry about adding new characters, or if someone's messed up the encoding of the script.

Re^2: tr{}{} doesn't wanna work.. what am I doing wrong?
by ultranerds (Hermit) on Feb 24, 2012 at 14:18 UTC
    Hi, Thanks for your suggestion :) For some reason it gives me a weird output, compared to yours?
    C:\Users\Andy>perl test2.pl Malformed UTF-8 character (unexpected non-continuation byte 0xc2, imme +diately after start byte 0xc0) in subroutine entry at test2.pl line 1 +0. Malformed UTF-8 character (unexpected non-continuation byte 0xc4, imme +diately after start byte 0xc2) in subroutine entry at test2.pl line 1 +0. Malformed UTF-8 character (unexpected non-continuation byte 0xe0, imme +diately after start byte 0xc4) in subroutine entry at test2.pl line 1 +0. Malformed UTF-8 character (unexpected non-continuation byte 0xe2, imme +diately after start byte 0xe0) in subroutine entry at test2.pl line 1 +0. Malformed UTF-8 character (unexpected non-continuation byte 0xe4, imme +diately after start byte 0xe2) in subroutine entry at test2.pl line 1 +0. Malformed UTF-8 character (unexpected non-continuation byte 0xc7, imme +diately after start byte 0xe4) in subroutine entry at test2.pl line 1 +0. Malformed UTF-8 character (unexpected non-continuation byte 0xe7, imme +diately after start byte 0xc7) in subroutine entry at test2.pl line 1 +0. Malformed UTF-8 character (unexpected non-continuation byte 0xc9, imme +diately after start byte 0xe7) in subroutine entry at test2.pl line 1 +0. Malformed UTF-8 character (unexpected non-continuation byte 0xca, imme +diately after start byte 0xc9) in subroutine entry at test2.pl line 1 +0. Malformed UTF-8 character (unexpected non-continuation byte 0xc8, imme +diately after start byte 0xca) in subroutine entry at test2.pl line 1 +0. Malformed UTF-8 character (unexpected non-continuation byte 0xcb, imme +diately after start byte 0xc8) in subroutine entry at test2.pl line 1 +0. Malformed UTF-8 character (unexpected non-continuation byte 0xe9, imme +diately after start byte 0xcb) in subroutine entry at test2.pl line 1 +0. Malformed UTF-8 character (unexpected non-continuation byte 0xea, imme +diately after start byte 0xe9) in subroutine entry at test2.pl line 1 +0. Malformed UTF-8 character (unexpected non-continuation byte 0xe8, imme +diately after start byte 0xea) in subroutine entry at test2.pl line 1 +0. Malformed UTF-8 character (unexpected non-continuation byte 0xeb, imme +diately after start byte 0xe8) in subroutine entry at test2.pl line 1 +0. Malformed UTF-8 character (unexpected non-continuation byte 0xcf, imme +diately after start byte 0xeb) in subroutine entry at test2.pl line 1 +0. Malformed UTF-8 character (unexpected non-continuation byte 0xcc, imme +diately after start byte 0xcf) in subroutine entry at test2.pl line 1 +0. Malformed UTF-8 character (unexpected non-continuation byte 0xce, imme +diately after start byte 0xcc) in subroutine entry at test2.pl line 1 +0. Malformed UTF-8 character (unexpected non-continuation byte 0xef, imme +diately after start byte 0xce) in subroutine entry at test2.pl line 1 +0. Malformed UTF-8 character (unexpected non-continuation byte 0xec, imme +diately after start byte 0xef) in subroutine entry at test2.pl line 1 +0. Malformed UTF-8 character (unexpected non-continuation byte 0xee, imme +diately after start byte 0xec) in subroutine entry at test2.pl line 1 +0. Malformed UTF-8 character (unexpected non-continuation byte 0xd6, imme +diately after start byte 0xee) in subroutine entry at test2.pl line 1 +0. Malformed UTF-8 character (unexpected non-continuation byte 0xd4, imme +diately after start byte 0xd6) in subroutine entry at test2.pl line 1 +0. Malformed UTF-8 character (unexpected non-continuation byte 0xd2, imme +diately after start byte 0xd4) in subroutine entry at test2.pl line 1 +0. Malformed UTF-8 character (unexpected non-continuation byte 0xf6, imme +diately after start byte 0xd2) in subroutine entry at test2.pl line 1 +0. Malformed UTF-8 character (unexpected non-continuation byte 0xf4, imme +diately after start byte 0xf6) in subroutine entry at test2.pl line 1 +0. Malformed UTF-8 character (unexpected non-continuation byte 0xf2, imme +diately after start byte 0xf4) in subroutine entry at test2.pl line 1 +0. Malformed UTF-8 character (unexpected non-continuation byte 0xdc, imme +diately after start byte 0xf2) in subroutine entry at test2.pl line 1 +0. Malformed UTF-8 character (unexpected non-continuation byte 0xdb, imme +diately after start byte 0xdc) in subroutine entry at test2.pl line 1 +0. Malformed UTF-8 character (unexpected non-continuation byte 0xd9, imme +diately after start byte 0xdb) in subroutine entry at test2.pl line 1 +0. Malformed UTF-8 character (unexpected non-continuation byte 0xfc, imme +diately after start byte 0xd9) in subroutine entry at test2.pl line 1 +0. Malformed UTF-8 character (unexpected non-continuation byte 0xfb, imme +diately after start byte 0xfc) in subroutine entry at test2.pl line 1 +0. Malformed UTF-8 character (unexpected non-continuation byte 0xf9, imme +diately after start byte 0xfb) in subroutine entry at test2.pl line 1 +0. Malformed UTF-8 character (unexpected non-continuation byte 0x3f, imme +diately after start byte 0xf9) in subroutine entry at test2.pl line 1 +0. Malformed UTF-8 character (unexpected continuation byte 0xab, with no +preceding start byte) in subroutine entry at test2.pl line 10. Malformed UTF-8 character (unexpected continuation byte 0xbb, with no +preceding start byte) in subroutine entry at test2.pl line 10. Malformed UTF-8 character (unexpected non-continuation byte 0xf3, imme +diately after start byte 0xed) in subroutine entry at test2.pl line 1 +0. Malformed UTF-8 character (unexpected non-continuation byte 0xf1, imme +diately after start byte 0xf3) in subroutine entry at test2.pl line 1 +0. Malformed UTF-8 character (unexpected non-continuation byte 0xd1, imme +diately after start byte 0xf1) in subroutine entry at test2.pl line 1 +0. Malformed UTF-8 character (unexpected non-continuation byte 0xe1, imme +diately after start byte 0xd1) in subroutine entry at test2.pl line 1 +0. Malformed UTF-8 character (unexpected non-continuation byte 0xe9, imme +diately after start byte 0xe1) in subroutine entry at test2.pl line 1 +0. Malformed UTF-8 character (unexpected non-continuation byte 0xf3, imme +diately after start byte 0xe9) in subroutine entry at test2.pl line 1 +0. Malformed UTF-8 character (unexpected non-continuation byte 0xfa, imme +diately after start byte 0xf3) in subroutine entry at test2.pl line 1 +0. Malformed UTF-8 character (unexpected non-continuation byte 0xc1, imme +diately after start byte 0xfa) in subroutine entry at test2.pl line 1 +0. Malformed UTF-8 character (unexpected non-continuation byte 0xc9, imme +diately after start byte 0xc1) in subroutine entry at test2.pl line 1 +0. Malformed UTF-8 character (unexpected non-continuation byte 0xcd, imme +diately after start byte 0xc9) in subroutine entry at test2.pl line 1 +0. Malformed UTF-8 character (unexpected non-continuation byte 0xd3, imme +diately after start byte 0xcd) in subroutine entry at test2.pl line 1 +0. Malformed UTF-8 character (unexpected non-continuation byte 0x0a, imme +diately after start byte 0xd3) in subroutine entry at test2.pl line 1 +0. Malformed UTF-8 character (1 byte, need 2, after start byte 0xda) in s +ubroutine entry at test2.pl line 10. _[]__________________________________________________________ C:\Users\Andy>
    Any ideas? TIA! Andy

      Check the encoding of your input data. Decode all data (to unicode) before operating on it in any way.

        Thanks, that did the trick :) I needed to convert $filename:

        utf8::decode($filename);

        Working like a charm now - thanks! :)

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://955919]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others wandering the Monastery: (3)
As of 2024-04-25 12:39 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found