Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

Re^3: Safely removing Unicode zero-width spaces and other non-printing characters

by haukex (Archbishop)
on Dec 04, 2019 at 19:21 UTC ( [id://11109670]=note: print w/replies, xml ) Need Help??


in reply to Re^2: Safely removing Unicode zero-width spaces and other non-printing characters
in thread Safely removing Unicode zero-width spaces and other non-printing characters

The source of the data is a large number of RSS feeds used which point to an even larger number of individual web pages.

Well, RSS is XML, and XML files should specify the encoding in the XML declaration, and XML parsers such as XML::LibXML do respect that declaration. However, it's possible that the XML declaration is missing or incorrect. In cases like that, one thing you might try is Encode::Guess, keeping in mind that it's just a guess. Or, if you're getting these feeds from web servers, you might look at the response headers for a hint.

  • Comment on Re^3: Safely removing Unicode zero-width spaces and other non-printing characters

Replies are listed 'Best First'.
Re^4: Safely removing Unicode zero-width spaces and other non-printing characters
by mldvx4 (Friar) on Dec 05, 2019 at 05:33 UTC

    Yes, the RSS reads fine of course.

    The problem is with the pages which the RSS points to. HTML and XHTML is a hot mess. Even when a respectable CMS is used, the authors can still paste in something weird. It is looking like I may have to treat each site individually and making individual filters might not be worth the effort. However, I am hoping for an automated way to normalize incoming text.

      I am hoping for an automated way to normalize incoming text.

      Well, my suggestions for guessing encoding still apply, plus looking at the meta tags in the HTML might help (with the same caveat that it might be wrong). But again, for specific help with the specific issue that you wrote about in the root node, you'll have to show us some debug output.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11109670]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (6)
As of 2025-07-11 14:33 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?
    erzuuliAnonymous Monks are no longer allowed to use Super Search, due to an excessive use of this resource by robots.