Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

Re: Safe string handling

by Your Mother (Chancellor)
on Aug 25, 2017 at 18:21 UTC ( #1198029=note: print w/replies, xml ) Need Help??


in reply to Safe string handling

Dealing with data that comes from webpages can be really complicated. There is likely to be a combination of ASCII, UTF-8, and wide characters in the data returned.

ASCII is valid UTF-8 so you cannot have a combination of UTF-8 and ASII in a string. You just have UTF-8. Wide-characters is ambiguous here. It seems to mean broken/unknown bytes that are putative character data. This doesn't happen much in the wild anymore. When it does you see pages littered with �s. So, I don't think that this situation is "likely." I can't think of the last time I saw it.

This "Hello\x{26c4}".encode("utf-8","\x{26f0}")."\x{10102}\x{2fa1b}" is broken on purpose (concatted perl UTF-8, binary UTF-8, and perl UTF-8). This can only happen through incorrect handling of character data encodings which, I assert, is fairly uncommon on the web today.

Perhaps I am misunderstanding. Can you give a live example of a site that your tool is meant to fix.

Replies are listed 'Best First'.
Re^2: Safe string handling
by tdlewis77 (Sexton) on Aug 26, 2017 at 00:50 UTC

    I used "wide characters" here in the same way that Perl does when it says "Wide character in print". You can have two-byte ("\x{26c4}") and four-byte ("\x{2fa1b}") wide characters.

    https://en.wikipedia.org/wiki/Wide_character

      This is an output layer encoding problem though; no more, no less. I think you have probably evolved your practices based on incomplete understanding of encoding issues. I encourage you to post an actual problem you think this solves so the monks can better advise.

Re^2: Safe string handling
by tdlewis77 (Sexton) on Aug 26, 2017 at 00:44 UTC
    This tool has been evolving over the course of several years. Every time I encounter some weirdness that breaks it, I've enhanced it. I recently rewrote it from scratch to incorporate everything I learned along the way. Offhand I can't tell you that there is a single site that has all the weirdness in my "broken on purpose" example, however, I can tell you that I've encountered websites that have mixed things up in ways that they were never intended. At this point, I think my tool handles everything I've ever encountered and is ready for anything that I haven't yet encountered. Even if you've only encountered well-behaved websites, there still is way to tell Perl to give you the sixth UTF-8 character from a string as in the "$snowman" example.

      Can you give us URLs to some example websites?

        Betcha the OP is decoding entity references without first decoding utf-8. That would produce the "mixed" encoding he's claiming to see.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1198029]
help
Chatterbox?
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (3)
As of 2017-12-18 22:34 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    What programming language do you hate the most?




















    Results (500 votes). Check out past polls.

    Notices?