Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

XML Parser not well-formed

by existem (Sexton)
on Nov 02, 2004 at 16:25 UTC ( #404658=perlquestion: print w/ replies, xml ) Need Help??
existem has asked for the wisdom of the Perl Monks concerning the following question:

Hello
I get this error when trying to load up an XML file i've been provided by a website.

not well-formed (invalid token) at line 233, column 28, byte 11073 at +C:/Perl/site/lib/XML/Parser.pm line 187

Now i've done a bit of playing around and I think this is because the XML contains a funny character. Namely, this one:

In my research to solve this problem I also came across this forum topic which talks about the same problem when loading up XML files with accented foreign characters (i.e. French, German etc.) http://f2o.org/forum/index.php?showtopic=1829

So can anyone help me force XML::Parser not to crash and accept these characters.

Ideally I would fix the XML stream, but that will probably take years to happen when relying on other people :-)

Thanks guys
Tom

Comment on XML Parser not well-formed
Download Code
Re: XML Parser not well-formed
by mirod (Canon) on Nov 02, 2004 at 17:13 UTC

    No you can't force expat (the library used by XML::Parser) to accept this character. The XML spec is very clear, and parser should NOT accept anything that is not well-formed.

    What you can do though, is figure out in which encoding your data is, and work from there. It is most likely not UTF-8 (or XML::Parser would have been happy with it), but probaly ISO-8859-1 or one of the Windows encodings.

    Have a look at the Perl-XML FAQ and the Encode::Guess module.

      Thanks for the reply.

      Interesting the XML does claim to be UTF-8, here is the first element in the file.

      <?xml version="1.0" encoding="UTF-8"?> <OFFERS> <OFFER> <OUR_ID>2752</OUR_ID> <FTA>PTHREE81200013</FTA> <MERCHANTCATEGORY>LG</MERCHANTCATEGORY> <NAME>LG U8120</NAME> <BRAND>LG</BRAND> <MODEL>U8120</MODEL> <PROMOTIONTEXT>Only 15.00 Per Month!</PROMOTIONTEXT> <DESCRIPTION>This latest video phone has stylish looks, rotating video + camera and photo address book. With a range of revolutionary feature +s, the LG 8120 is &lt;b&gt; 'the'&lt;/b&gt; video mobile to be seen w +ith.</DESCRIPTION> <NETWORK>Three</NETWORK> <RENTAL>15.00</RENTAL> <FREE_OFF_PEAK_MINS>0</FREE_OFF_PEAK_MINS> <FREE_CN_MINS>500</FREE_CN_MINS> <FREE_ANYTIME_MINS>0</FREE_ANYTIME_MINS> <FREE_SMS>100</FREE_SMS> <TAR>Talk &amp; Text 600 Special</TAR> <ADDINFO> FREE Promo 12 Months Half Price Line Rental, 3 Months FREE i +nsurance and <b>12 Months Reduced Line Rental - Only GBP15 Per Month< +/b> </ADDINFO> <DEEPLINK>http://www3.fonetasticmobile.co.uk/1/showphones.php?item=275 +2</DEEPLINK> <SMALLIMAGE>http://www3.fonetasticmobile.co.uk/img/phones/bestdeal_u81 +20.jpg</SMALLIMAGE> <BIGIMAGE>http://www3.fonetasticmobile.co.uk/img/phones/big_u8120.jpg< +/BIGIMAGE> <PRICE>0</PRICE> </OFFER> </OFFERS>

      I suspect the feed is just not as good as it should be.

      I have actually got around the problem by processing the file manually beforing loading it up with XML::Parser, and removing the dodgy characters.

      Thanks,
      Tom.

        I have actually got around the problem by processing the file manually beforing loading it up with XML::Parser, and removing the dodgy characters.

        That's one way to do it. You could probably figure out in which encoding they are and replace them by the proper utf-8 character. My guess is that some of the text , like the DESCRIPTION is entered through either a web form or word processor, it shoul be possible to find out what encoding is used.

Re: XML Parser not well-formed
by existem (Sexton) on Nov 02, 2004 at 18:05 UTC

    Along the same lines, I seem to be having a problem with loading up very large XML files into memory. The entire file contains in the region of 500 elements like the one described above, however my script seems to hang half way through loading it up with XMLin.

    I have tried to dump out the result after loading it up but get truncated result.

    Is there a limit to the size of the XML file I can load up?

    Thanks,
    Tom

      The limit is function of the amount of memory you have on your system.

      <plug type="shameless">If the whole document doesn't fit in memory, you can play with XML::Twig to load parts of it, process them and then free the memory before processing the next chunk. And of course, if you are used to XML::Simple interface, you can use the simplify method on any element to get the same structure that XMLin would have given you.</plug>

Re: XML Parser not well-formed
by gmpassos (Priest) on Nov 02, 2004 at 22:35 UTC
    Take a look at XML::Smart and it's parser for wild XML, XML::Smart::HTMLParser, that have the same interface of XML::Parser.

    Graciliano M. P.
    "Creativity is the expression of liberty".

Re: XML Parser not well-formed
by bart (Canon) on Nov 04, 2004 at 14:13 UTC
    That character looks like one of Microsoft's additions to ISO-Latin-1, somewhere in the range 128-159.

    Don't do that. Your XML is invalid because of it. Please don't try to patch the XML parser to accept it, you're making life harder for everybody — XML parsers merciless rejecting invalid XML is a feature, forcing people to produce proper XML. Guessswork isn't doing anybody any good.

    Instead, replace it with the proper Unicode character in the proper character encoding (UTF-8?) or as a numerical entity, in the XML file. It ought to work then.

    You can find the equivalent character code (in hex) in that table I linked to, and it would seem to me that this is the one:

    0x92	0x2019	#RIGHT SINGLE QUOTATION MARK
    

    So "&#8217;" ought to do it. Test: "’"

    p.s. Actually, you should get the source of the data to fix it, they did not do a proper job.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://404658]
Approved by Arunbear
Front-paged by Old_Gray_Bear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others contemplating the Monastery: (9)
As of 2014-07-30 16:46 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (236 votes), past polls