http://www.perlmonks.org?node_id=457265

nan has asked for the wisdom of the Perl Monks concerning the following question:

Hi All,

I'm trying to process a 2GB XML file by using XML:Twig (some people suggested) as XML:Simple can't handle such a huge file. My perl code runs perfectly when I use a smaller sample file.

However, when I tried to read this 2GB XML file, it always shows "The instruction at 0xaddress referenced memory at 0xadress. The memory could not be written." after several minutes.

So far I'm not sure if it is because the file is too big to load or I used too many for each loops to search XML elements and relevant attributes.

Could anyone offer me some help?

Thanks,

Replies are listed 'Best First'.
Re: Memory errors while processing 2GB XML file with XML:Twig on Windows 2000
by dbwiz (Curate) on May 15, 2005 at 19:08 UTC

    2GB XML files are beyond the limit of my direct experience with XML, but I can address you to a few places where you may be able to find the right answer:

    • There is an article written by one of the creators of XML, dealing with this kind of problems.
    • Additionally, there was a discussion here on Perlmonks (is XML too hard?), where there are several interesting points
    • Finally, you may try your hand with XML::TokeParser, which is a module that came out of the above mentioned discussion.

    Good luck.

      Hi,

      Thank you so much for the help. I'll try XML:TokeParser later and let you know the result.

      Thanks again,
Re: Memory errors while processing 2GB XML file with XML:Twig on Windows 2000
by mirod (Canon) on May 15, 2005 at 20:07 UTC

    You don't give us much to work with. Some code and perhaps the amount of RAM on your system would help.

    If you do a straight XML::Twig->new->parsefile( 'my_big_fat_xml_file.xml');, then the resulting data structure should need somewhere around 20GB. That's why XML::Twig let's you process a file one chunk at a time, and purge the memory when you're done with it.

    The README for the module (at least for the latest version) includes links to lots of resources about the module. You could start by looking at xmltwig.com.

      Hi,

      If you don't mind, please refer to the codes and XML segment in my replies to other people. My RAM is 1GB and my perl version is the latest. Initially, I have a virtual memory error too but it's solved after I changed virtual memory to the maximum (4GB), now I only have experienced a memory writen error. Do you think it could be better if I try linux?

      Thanks again,
Re: Memory errors while processing 2GB XML file with XML:Twig on Windows 2000
by Zaxo (Archbishop) on May 15, 2005 at 18:52 UTC

    Not much info, but the 2GB limit suggests that your perl or OS lacks large file support. With OS support, perl can be recompiled to provide that. Run perl -V on the command line to check.

    After Compline,
    Zaxo

      Hi Zaxo,

      Thank you for the advice. My Perl is v5.8.6 built for MSWin32-x86-multi-thread.

      My XML sample file is shown below:

      Basically, the XML file has two key parallelled nodes: <Topic/> and <ExternalPage/>. If there is a <link/> child existing in <Topic/>, <ExternalPage/> node will be existing for showing more detailed information about the content of this <link/> such as <d:Title/> and <d:Description/>.

      However, not every <Topic/> node has one or more <link/> child, so I need to write a loop to find out if <link/> is a child of <Topic/> nodes. If there are some <link/> nodes existing, I will check each of <ExternalPages> to output more information.

      my codes are shown below which is quite straightforward:

      Thanks again,
Re: Memory errors while processing 2GB XML file with XML:Twig on Windows 2000
by tlm (Prior) on May 15, 2005 at 18:55 UTC

    Without seeing some code it is impossible to give any concrete advice. How have you taken advantage of XML::Twig to process only chunks of the XML tree? Have you considered using a purely event-driven parser like XML::Parser?

    the lowliest monk

      Hi,

      Thank you for the adive. Actually, as the file is huge, some nice people suggested me to try XML:Twig as it is more efficient. My XML case is a little bit funny, if you don't mind, please have a look in more details below:

      My XML sample file is shown below:

      Basically, the XML file has two key parallelled nodes: <Topic/> and <ExternalPage/>. If there is a <link/> child existing in <Topic/>, <ExternalPage/> node will be existing for showing more detailed information about the content of this <link/> such as <d:Title/> and <d:Description/>.

      However, not every <Topic/> node has one or more <link/> child, so I need to write a loop to find out if <link/> is a child of <Topic/> nodes. If there are some <link/> nodes existing, I will check each of <ExternalPages> to output more information.

      my codes are shown below which is quite straightforward:

      Thanks again,

        It looks like you are using XML::Twig in "tree mode" as opposed to "stream mode", which is what I suspected. It means that your code tries to read the entire tree into memory, instead of processing it one chunk at a time, which is what stream mode is good for.

        As I understand it, there are two basic approaches to parsing a tree like this. You can first build a tree object that then your program can traverse up and down as it pleases, and manipulate like any other data structure. Alternatively, you can define handlers (aka callbacks) that the parser will invoke whenever it encounters a particular condition (e.g. when it finds a particular tag) as it parses the tree. The latter ("event-driven") approach has the advantage that the parser does not need to read the whole tree into memory; the parsing and whatever you want to do with the parsed text go hand-in-hand. The downside is that your program cannot backtrack to examine parts of the tree that have already been parsed. I'm not very familiar with XML::Twig but it appears that it is a bit of a hybrid of these two basic approaches, in that it lets you install handlers that are triggered by parsing events, but it also lets your program access subtrees of the partially parsed tree. These subtrees can be manipulated as a whole, and then purged from memory. This makes it possible to keep only a small part of the tree in memory, just like with any other event-driven parser, such as XML::Parser, but manipulate entire chunks of the tree (possibly the whole tree) like you could with a tree-building parser.

        Anyway, be that as it may, below is my attempt to re-cast your code in terms of XML::Twig handlers. See the docs for XML::Twig for more details. I could not understand the point of changing directories, so that part of the code may be messed up; I commented it out for the purpose of running the code. The program installs two handlers at the time of invoking the constructor for the parser, one for Topic elements and one for ExternalPage elements. The handlers communicate via a shared variable, the %links hash.

        Let me know how it goes.

        the lowliest monk

Re: Memory errors while processing 2GB XML file with XML:Twig on Windows 2000
by Molt (Chaplain) on May 16, 2005 at 15:25 UTC

    Okay, for a start I'm not mentioning the 2GB file size issue, that's been covered well enough already. I'm just touching XML::Twig itself.

    Looking at the docs for XML::Twig it looks like it is capable of handling very large XML files by not reading them into memory in one go. Unfortunately I don't think your code does this, you don't set up the handlers and hence it tries to load the entire XML tree into memory. Boom, that'd need 20GB of memory.

    Reread the docs on XML::Twig, look at the bit on "Processing an XML document chunk by chunk". You need to guarantee you don't have too much in memory at any one time, I hope this is a document built up of lots of small chunks or you're in for an even larger challenge.

    I'll admit that personally I'd be using a full SAX parser at this point in any case, from what I've seen from my cursory look at XML::Twig does it doesn't look much simpler than trying to do it that way. It's all just handlers and callbacks at the end of the day.

    As for which SAX parser I'd use I really don't know. I'd normally use >XML::LibXML, but I'm not sure how that'll work on Windows so I can't comment there.