Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

Re^3: Safely removing Unicode zero-width spaces and other non-printing characters

by haj (Vicar)
on Dec 04, 2019 at 10:37 UTC ( [id://11109654]=note: print w/replies, xml ) Need Help??


in reply to Re^2: Safely removing Unicode zero-width spaces and other non-printing characters
in thread Safely removing Unicode zero-width spaces and other non-printing characters

It is input decoding which matters here. There is no way to convert incoming data to UTF without treating the original encoding of each individiual input. The issue with harvesting from different sites is that the encoding of these sites can be 1) different and 2) just broken for a few of the sites.

Your code snippet s/\x{00A0}/ /gm; just works if all input has been properly decoded into to Perl's "character" semantics (I avoid to call it UTF-something because this is misleading), protected by the error handling of the Encode module.

Of course, you need to encode your output, too. binmode(STDOUT, ":encoding(utf8)"), converts Perl's characters into a valid UTF-8 stream.

  • Comment on Re^3: Safely removing Unicode zero-width spaces and other non-printing characters

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11109654]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others examining the Monastery: (5)
As of 2025-07-17 10:15 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?
    erzuuliAnonymous Monks are no longer allowed to use Super Search, due to an excessive use of this resource by robots.