Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

How to deal with malformed utf8 from XML parsing

by ribasushi (Monk)
on Jan 09, 2008 at 19:20 UTC ( #661465=perlquestion: print w/ replies, xml ) Need Help??
ribasushi has asked for the wisdom of the Perl Monks concerning the following question:

Greetings honorable monks!

I have a problem where XML parsing returns an apparently malformed utf8 string. I am attaching an example script, complete with relevant data, and a reference to the URL where this data is taken from (the XML is generated by Yahoo). My questions are:
  • How can I detect that there is something wrong with the string? Encode::is_utf8 ($str, 1) is supposed to be false (it is supposed to fail the check).
  • If I detect a malformed string how can I sanitize it? For instance how can I replace the offending byte sequence with a single '?' or something?
Thank you for your help!
use warnings; use strict; use Encode; use HTML::Entities; use XML::Twig; my $objinfo_parser = XML::Twig->new ( twig_handlers => { 'Product[@Id = "most3usb20on"]' => \&_test_hand +ler } ); $objinfo_parser->parse (decode_entities (join '', <DATA>) ); #$objinfo_parser->parseurl ('http://www.3btech.net/objinfo.xml'); sub _test_handler { my ($twig, $elt) = @_; my $str = substr ($elt->field ('Description'), -12); print "The string below passes a utf8 well-formedness test, why?\n +" if Encode::is_utf8 ($str, 1); print "$str\n\n"; for (unpack ('U*', $str)) { printf "0x%X\n", $_; } print "\n"; exit; } # The xml snippet below is taken directly from the above url # (the url will take about 15s to download and parse) # HTML encoded to preserve offending characters __DATA__ <?xml version="1.0" encoding="UTF-8"?> <Product Id="most3usb20on"> <Description>Moving Star 3.5" USB 2.0 One Button Backup Aluminum + Hard Drive Enclosure &#xC2;&#x96; Black</Description> </Product>

Comment on How to deal with malformed utf8 from XML parsing
Download Code
Reaped: Re: How to deal with malformed utf8 from XML parsing
by NodeReaper (Curate) on Jan 09, 2008 at 20:01 UTC
Re: How to deal with malformed utf8 from XML parsing
by Juerd (Abbot) on Jan 09, 2008 at 20:02 UTC

    I don't see the problem here. You have \xC2\x96 (2 bytes) in your XML data, which is the correct UTF-8 encoding for U+0096, and indeed your code's output properly shows that:

    0x73 0x75 0x72 0x65 0x20 0x96 <-- there it is 0x20 0x42 0x6C 0x61 0x63 0x6B
    print "$str\n\n"; is a mistake, though. You shouldn't print unicode text data without specifying an output :encoding on the filehandle, or encode()ing it manually.

    By the way, instead of the confusing, error-prone, and tedious process of figuring out the internal state of a variable using is_utf8 and a normal print, please use Devel::Peek instead. Its Dump function, called with your $str, would output:

    SV = PV(0x8641d2c) at 0x82ea98c REFCNT = 1 FLAGS = (PADBUSY,PADMY,POK,pPOK,UTF8) PV = 0x8631050 "sure \302\226 Black"\0 [UTF8 "sure \x{96} Black"] CUR = 13 LEN = 16
    As you can see, the "UTF8 flag" is on. The logical unicode string is within [UTF8 ... ], after the representation of the internal byte buffer.

    Now, of course, the usefulness of the character U+0096, "START OF GUARDED AREA" is a rather different story. It probably is the result of mis-interpreting Windows-1251 data as ISO-8859-1 data. Windows-1251's 0x96 character is U+2013, "EN DASH", not U+0096.

      print "$str\n\n"; is a mistake, though. You shouldn't print unicode text data without specifying an output :encoding on the filehandle, or encode()ing it manually.

      I thought I can print unicode to STDOUT. I will read more on that.


      By the way, instead of the confusing, error-prone, and tedious process of figuring out the internal state of a variable using is_utf8 and a normal print, please use Devel::Peek instead

      If you pay closer attention you will see that I am using the validating capability of is_utf8 ($string, 'true_value').


      I verified your claim, indeed it seems that this is not malformed utf. The reason I started digging is google complaining this is not a valid character. I will troubleshoot more.

      Thank you!

        I thought I can print unicode to STDOUT. I will read more on that.

        See perlunitut. Filehandles work with bytes, not characters.

        If you pay closer attention you will see that I am using the validating capability of is_utf8 ($string, 'true_value').

        It checks the INTERNAL BYTE BUFFER of the unicode string. It is an internal consistency check, and should only be used to verify Perl's internal functioning, not your own strings. Apparently the [INTERNAL] in the documentation is not clear enough, given the huge number of people who don't realise that it is an internal function. I'll see if I can get that changed.

        Juerd # { site => 'juerd.nl', do_not_use => 'spamtrap', perl6_server => 'feather' }

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://661465]
Approved by toolic
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others pondering the Monastery: (4)
As of 2014-09-19 22:07 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (148 votes), past polls