Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options

Re: Safe string handling

by Your Mother (Bishop)
on Aug 25, 2017 at 18:21 UTC ( #1198029=note: print w/replies, xml ) Need Help??

in reply to Safe string handling

Dealing with data that comes from webpages can be really complicated. There is likely to be a combination of ASCII, UTF-8, and wide characters in the data returned.

ASCII is valid UTF-8 so you cannot have a combination of UTF-8 and ASCII in a string. You just have UTF-8. Wide-characters is ambiguous here. It seems to mean broken/unknown bytes that are putative character data. This doesn't happen much in the wild anymore. When it does you see pages littered with �s. So, I don't think that this situation is "likely." I can't think of the last time I saw it.

This "Hello\x{26c4}".encode("utf-8","\x{26f0}")."\x{10102}\x{2fa1b}" is broken on purpose (concatted perl UTF-8, binary UTF-8, and perl UTF-8). This can only happen through incorrect handling of character data encodings which, I assert, is fairly uncommon on the web today.

Perhaps I am misunderstanding. Can you give a live example of a site that your tool is meant to fix.

Update: s/ASII/ASCII/;

Replies are listed 'Best First'.
Re^2: Safe string handling
by tdlewis77 (Sexton) on Aug 26, 2017 at 00:50 UTC

    I used "wide characters" here in the same way that Perl does when it says "Wide character in print". You can have two-byte ("\x{26c4}") and four-byte ("\x{2fa1b}") wide characters.

      This is an output layer encoding problem though; no more, no less. I think you have probably evolved your practices based on incomplete understanding of encoding issues. I encourage you to post an actual problem you think this solves so the monks can better advise.

Re^2: Safe string handling
by tdlewis77 (Sexton) on Aug 26, 2017 at 00:44 UTC
    This tool has been evolving over the course of several years. Every time I encounter some weirdness that breaks it, I've enhanced it. I recently rewrote it from scratch to incorporate everything I learned along the way. Offhand I can't tell you that there is a single site that has all the weirdness in my "broken on purpose" example, however, I can tell you that I've encountered websites that have mixed things up in ways that they were never intended. At this point, I think my tool handles everything I've ever encountered and is ready for anything that I haven't yet encountered. Even if you've only encountered well-behaved websites, there still is way to tell Perl to give you the sixth UTF-8 character from a string as in the "$snowman" example.

      Can you give us URLs to some example websites?

        Betcha the OP is decoding entity references without first decoding utf-8. That would produce the "mixed" encoding he's claiming to see.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1198029]
and the pool shimmers...

How do I use this? | Other CB clients
Other Users?
Others perusing the Monastery: (4)
As of 2018-03-21 05:53 GMT
Find Nodes?
    Voting Booth?
    When I think of a mole I think of:

    Results (264 votes). Check out past polls.