Beefy Boxes and Bandwidth Generously Provided by pair Networks Joe
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

Re^2: How to deal with malformed utf8 from XML parsing

by ribasushi (Monk)
on Jan 09, 2008 at 20:12 UTC ( #661488=note: print w/ replies, xml ) Need Help??


in reply to Re: How to deal with malformed utf8 from XML parsing
in thread How to deal with malformed utf8 from XML parsing

print "$str\n\n"; is a mistake, though. You shouldn't print unicode text data without specifying an output :encoding on the filehandle, or encode()ing it manually.

I thought I can print unicode to STDOUT. I will read more on that.


By the way, instead of the confusing, error-prone, and tedious process of figuring out the internal state of a variable using is_utf8 and a normal print, please use Devel::Peek instead

If you pay closer attention you will see that I am using the validating capability of is_utf8 ($string, 'true_value').


I verified your claim, indeed it seems that this is not malformed utf. The reason I started digging is google complaining this is not a valid character. I will troubleshoot more.

Thank you!


Comment on Re^2: How to deal with malformed utf8 from XML parsing
Re^3: How to deal with malformed utf8 from XML parsing
by Juerd (Abbot) on Jan 09, 2008 at 22:11 UTC

    I thought I can print unicode to STDOUT. I will read more on that.

    See perlunitut. Filehandles work with bytes, not characters.

    If you pay closer attention you will see that I am using the validating capability of is_utf8 ($string, 'true_value').

    It checks the INTERNAL BYTE BUFFER of the unicode string. It is an internal consistency check, and should only be used to verify Perl's internal functioning, not your own strings. Apparently the [INTERNAL] in the documentation is not clear enough, given the huge number of people who don't realise that it is an internal function. I'll see if I can get that changed.

    Juerd # { site => 'juerd.nl', do_not_use => 'spamtrap', perl6_server => 'feather' }

      Well today was definitely a fruitful day. I learned about C0/C1 control codes which were the reason for google to complain. I also realized where this stuff actually comes from (someone pasting a mis-encoded chunk of text into a browser window). Finally I know not to use is_utf8 anymore :)
      Thank you for your comments.

      P.S.How I ended up fixing this:
      $_ =~ s/[\x{80}-\x{9F}]/\x{FFFD}/g;

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://661488]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others surveying the Monastery: (10)
As of 2014-04-21 16:14 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    April first is:







    Results (496 votes), past polls