Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Parsing extremely large XML files

by Spitters (Initiate)
on Jul 22, 2002 at 16:12 UTC ( #184109=perlquestion: print w/replies, xml ) Need Help??

Spitters has asked for the wisdom of the Perl Monks concerning the following question:

Hey. I need to parse an extremely large XML database. First I want to generate simple ASCII lists of all values of certain elements on various levels in the XML structure. So that's a on-time action and I don't really mind if it takes a while to finish. But, later I also want to be able to receive elements from the database really fast, or find element values real-time and stuff. I don't have any experience with XML modules for Perl, so my questions are: 1) what's the most appropriate XML parser for processing large XML files? 2) can anyone provide some code for basic operations? Thanks a lot, Martijn

Replies are listed 'Best First'.
Re: Parsing extremely large XML files
by samtregar (Abbot) on Jul 22, 2002 at 17:24 UTC
    You'll need to use a SAX parser (XML::SAX) or a SAX-like stream parser (XML::Parser) to keep from loading the whole file into memory when you parse. Also, you'll need to think carefully about what you do with the data as you parse it; it's no good using a stream parser if you just build a giant hash using it! I had success parsing 500MB of XML into a MySQL database with XML::Parser, but these days I would probably go to the shiny new SAX route.

    -sam

Re: Parsing extremely large XML files
by simeon2000 (Monk) on Jul 22, 2002 at 16:49 UTC
    Although I haven't done much XML parsing in my day, I am positive that SAX parsers are better fitted to looping through extremely large XML files than DOM parsers. The reason being is that SAX parses the content line-by-line and is parsed using event-handler code, whereas DOM loads it all into memory at one go.

    Although there seem to be a plethora of SAX related modules on CPAN, I would reccomend starting with one like XML-SAX.

    "Falling in love with map, one block at a time." - simeon2000

Database vs. XML document
by cebrown (Pilgrim) on Jul 22, 2002 at 19:02 UTC
    Others have correctly commented that a SAX parser is the way to go. I'll focus in on something else -- you mention that you want to get elements from the XML really fast.

    If that's the case, you ought to be thinking about parsing the document once into a database, then using a database for future access. They're really not all that scary, just grab and install Postgres or something else and you can be on your way.

    Databases will be good for you today because you can define indexes for searching, and they will be good for you tomorrow because odds are your "really fast" searching will become "really fast, and pretty complex, and I want to make updates too..." as time goes on.

    If you don't use a database, you will need to SAX parse your whole XML document every time, and although SAX is really fast, the I/O alone will kill you.

Re: Parsing extremely large XML files
by mp (Deacon) on Jul 22, 2002 at 22:36 UTC

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://184109]
Approved by broquaint
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others wandering the Monastery: (4)
As of 2023-05-31 01:24 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?