Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

XML::RSS

by alexg (Beadle)
on Mar 21, 2003 at 10:58 UTC ( [id://244827]=perlquestion: print w/replies, xml ) Need Help??

alexg has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I'm writing a quick and dirty RSS feed aggregator and I'm getting very frustrated with one tiny problem. Occasionally the RSS XML docs contain characters which cause XML::Parser to choke.

The XML::Parser error message generated is:
"not well-formed (invalid token) at line 15, column 19, byte 530"

when I look at byte 530 it turns out to be the 'é' in Nescafé. Other exotic characters also cause XML::Parser to stop dead. I've tried the nice_string function from the Unicode man page:

sub nice_string { join("", map { $_ > 255 ? # if wide character... sprintf("\\x{%04X}", $_) : # \x{...} chr($_) =~ /[[:cntrl:]]/ ? # if control character sprintf("\\x%02X", $_) : # \x.. chr($_) # else as themselves } unpack("U*", $_[0])); # unpack Unicode }

but this enrages XML::Parser even further and it fails and the first end-of-line character. I'm using LWP::Simple to grab the XML so my script essentially looks like this:

my $rss = new XML::RSS; eval { $rss->parse(nice_string(get($url))); };

Can anyone recomend a module/function that will reliably sanitise the string that get() returns, in an encoding suitable for XML::Parser?

PS I've looked at the 'encoding' option for XML::Parser but it doesn't seem to change the result :(

Replies are listed 'Best First'.
•Re: XML::RSS
by merlyn (Sage) on Mar 21, 2003 at 11:36 UTC
    Sounds to me like the RSS is incorrectly claiming that it is UTF-8 when in fact it is Latin-1. That's a common problem when people are writing RSS with hand-rolled ad-hoc tools instead of formal DOM tools.

    -- Randal L. Schwartz, Perl hacker
    Be sure to read my standard disclaimer if this is a reply.

      Ok, I've done some more reading - let me see if I've got this right:

      LWP::Simple returns HTML which is encoded in ISO-8859-1 (Latin-1).
      XML::Parser defaults to UTF-8 (unicode) if the XML does not specify an encoding.

      So using:
      my $rss = new XML::RSS( 'encoding' => 'ISO-8859-1' );

      should work.

      It doesn't.

        Specifying the encoding like that is only used when creating an RSS stream using the library. It won't affect a parsed input stream's encoding.

        --rjray

Re: XML::RSS
by zby (Vicar) on Mar 21, 2003 at 11:29 UTC
    Have you tried to just set the correct encoding in the XML? Something like this:  <?xml encoding='ISO-8859-1'>.
      Yeah the XML docs all claim to be ISO-8859-1, even though they don't seem to be...
        E acute is in ISO-8859-1, perhaps it is not right encoded? From the XML::Parser documentation: "The built-in encodings are: UTF-8, ISO-8859-1, UTF-16, and US-ASCII." so this is not the case that ISO-8859-1 is not handled by the parser.

        By the way when you make it UTF-8 and it claims to be ISO-8859-1 you don't fix anything.

Re: XML::RSS
by AnthonyLewis (Novice) on Mar 23, 2003 at 07:40 UTC
    I'm currently working on a content management system that allows users to enter text that is stored in XML files. Everything worked great until a Spanish speaking user tried it.

    I stole this snippet from the Perl-XML FAQ at http://perl-xml.sourceforge.net/faq/

    $str =~ s/([^\x0A-\x7F])/'&#' . ord($1) . ';'/gse;
    Run your data through this before sending it to the parser and things might work better...
Re: XML::RSS
by ajt (Prior) on Mar 24, 2003 at 21:40 UTC

    alexg,

    Malformed XML is the bane of RSS. According to Mark Pilgrim about 10% of typical RSS feeds are malformed*, indeed the UK IT publication The Register has usable XML for only a few days in a given month.

    You will find a wide range of problems that will cause XML::Parser the core of XML::RSS to explode:

    • Data encoded in one format, but declared in another (or in default utf-8).
    • Junk before the start XML declaration, the CMS Vignette tends to do this, and it's popular with big companies.
    • Badly nested tags, the CMS is sloppy at non-well formness checking, so it comes out and goes into the RSS feed broken.
    • Inproperly escaped ampersands and entities are a very common problem too.

    In this node "How do I clean RSS feeds to make them usable?", Matts suggested his rssmirror, the guts of which are now included in both XML::RSS and XML::RSS::Tools.

    I became so annoyed with bad XML in RSS feeds that I wrote XML::RSS::Tools to deal with the problems I found, which led to brian d foy taking over XML::RSS fixing a lot of it's problems, and with time designing a whole new version.

    See also:

    Good Luck!

    * Parsing RSS At All Costs


    --
    ajt

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://244827]
Approved by adrianh
Front-paged by data64
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others rifling through the Monastery: (5)
As of 2024-04-20 00:13 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found