Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

Why XML not well formed?

by nan (Novice)
on Jun 30, 2005 at 12:54 UTC ( [id://471291]=perlquestion: print w/replies, xml ) Need Help??

nan has asked for the wisdom of the Perl Monks concerning the following question:

Hi all,

I encountered a really weird error while using perl script to search in a big XML/RDF file (225MB) in CGI.

I don't know what's wrong with it as the script works perfectly with a sample XML/RDF file (only 4K) which is the same format as the big one and their only difference is the file size. I attached the error msg below for your reference

Software error:
not well-formed (invalid token) at line 221, column 97, byte 12020 at C:/Perl/site/lib/XML/Parser.pm line 187

I have searched through perl-xml Q&A's and it seems that I did't have problems they listed such as double quotations or incorrect encoding notations. So pls could anyone offer some suggestions?

many many thanks,

<?xml version='1.0' encoding='UTF-8' ?>
<RDF>
<Topic r:id="Top/Arts/Movies/Titles/1/10_Rillington_Place">
<link r:resource="http://www.britishhorrorfilms.co.uk/rillington.shtml"/>
<link r:resource="http://www.shoestring.org/mmi_revs/10-rillington-place.html"/>
<link r:resource="http://www.tvguide.com/movies/database/ShowMovie.asp?MI=22983"/>
<link r:resource="http://us.imdb.com/title/tt0066730/"/>
</Topic>
</RDF>

nan

Replies are listed 'Best First'.
Re: Why XML not well formed?
by davorg (Chancellor) on Jun 30, 2005 at 13:07 UTC

    Have you taken a close look at line 221, column 97? That's where your problem is.

    --
    <http://www.dave.org.uk>

    "The first rule of Perl club is you do not talk about Perl club."
    -- Chip Salzenberg

      And since you have the byte offset in the file you can explicitly print the offending portion of the input with some context thusly:

      perl -le 'open(X,shift())or die "$!";seek(X, 12000, 0)or die "$!";read + X, $b, 40;print $b, "\n", " " x 19, "^\n"' foo.xml

      Update: Oop, I see by the paths mentioned you're on Wintendo; you'll probably want to adjust the quotes on that or store it into a file and run that (or just get a real shell and/or OS . . . :)

      --
      We're looking for people in ATL

      Hi guys,

      Thank you for your quicky replies. I think I found what's wrong inside xml document. It seems that only if a link contains character '&' then the parser reports an error.

      For example, <link r:resource="http://www.urbancinefile.com.au/home/article_view.asp?Article_ID=3801&Section=Reviews"/>

      As I need to read <link/> elements one by one and compare the attribute value with user's input, my new question is, how can I overcome this '&' problem? I have tried to use '\' before '&' but it doesn't work.

      Thanks again,

      Nan

        If you are being passed data that contains a raw '&' character that hasn't been converted to '&amp;' then you aren't being passed valid XML and no XML parser will be able to deal with it.

        You should ask your data provider to fix their processes so that they _do_ sent you valid XML.

        --
        <http://www.dave.org.uk>

        "The first rule of Perl club is you do not talk about Perl club."
        -- Chip Salzenberg

Re: Why XML not well formed?
by mirod (Canon) on Jun 30, 2005 at 13:34 UTC

    As mentionned before, there is probably nothing wrong with the script, just something wrong with the data. Try looking at line 221, column 97, or using the ErrorContext => 1 argument when you create the XML::Parser object, which will display the faulty line.

    A not well-formed (invalid token) error is often found when an ampersand (&) or an opening bracket (<) are not escaped in the XML.

      Mirod,

      Yes, you are absolutely right. I think '&' is the key...but I don't know how to overcome it as I'm new to perl? Please, if you have any ideas. I've tried put a '\' before '&' but it doesn't work.

      Many thanks, Nan

      You fixed my issue as well. I had "&" characters in my data. Thank you very much.
Re: Why XML not well formed?
by BaldPenguin (Friar) on Jun 30, 2005 at 15:40 UTC
    I would spend special attention on those resource links, as mirod mentioned, the & is a great way to get that error and very common in urls with params.

    Don
    WHITEPAGES.COM | INC

    Edit by castaway: Closed small tag in signature

      Hi Don,

      Yes you are right, but how can I overcome this '&' problem? I've tried to put '\' before '&' but it's doesn't work.

      many thanks!

      Nan

        You could regex the &:
        $line =~ s/(&)/$1amp;/g;

        Don
        WHITEPAGES.COM | INC

        Edit by castaway: Closed small tag in signature

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://471291]
Approved by polettix
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having an uproarious good time at the Monastery: (5)
As of 2024-03-19 02:12 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found