Best way to Download and Process a XML file

perl_gog has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Best way to Download and Process a XML file by tobyink (Canon) on Sep 24, 2012 at 22:30 UTC
150 GB? Ouch. AnyEvent::HTTP should allow you to issue an HTTP request, and process it a chunk at a time, while it arrives, without having to save it anywhere. And XML::Twig can parse XML chunk by chunk. Pairing the two you ought to be able to do this without temporary files. Exactly how to do it, I can't help you. I have limited experience with AnyEvent::HTTP; and virtually none with XML::Twig. `perl -E'sub Monkey::do{say$_,for@_,do{($monkey=[caller(0)]->[3])=~s{::}{ }and$monkey}}"Monkey say"->Monkey::do'`	[reply]
Re: Best way to Download and Process a XML file by remiah (Hermit) on Sep 25, 2012 at 01:50 UTC
XML::Twig has parseurl() method and I guess it is what you are looking for. Twig reference on CPAN twig site,includes tutorial I don't have experience for huge XML processing. purge() or flash() would have great effect on 150Gb XML. I would like to hear your impression, if I could... regards	[reply]
Re: Best way to Download and Process a XML file by dHarry (Abbot) on Sep 25, 2012 at 12:05 UTC
Sanity check: 150GB XML file??? Maybe it's time to rethink the problem?! Assuming enough disk space and patience option 1 will work. Option 2 also has its drawbacks, e.g. "finally save it" sounds to me like keeping the file in memory... Or do you want to edit the file "in place"? Anyway, with XML files this big you probably don't want a pure Perl implementation. XML::LibXML jumps to mind. I have happy experience parsing big XML files (10s of GB) Xerces. Cheers Harry	[reply]
Re^2: Best way to Download and Process a XML file by Jenda (Abbot) on Sep 25, 2012 at 13:59 UTC
I do hope you meant XML::LibXML::SAX. The thing is that what's normally meant under XML::LibXML is a DOM style parser, that is something that slurps the whole XML into memory and creates a maze of objects. In case of XML::LibXML the objects reside in the C land so they do not waste as much space as they would if they were plain Perl objects, but still with a huge XML this is not a good candidate. Even if the docs make some sense to you. If perl_gog can convince some HTTP library to give him a filehandle from which he can read the decoded data of the response, he could use XML::Rules in the filter mode and print the transformed XML directly into a file with just some buffers and a twig from the XML kept in memory. Of course he'd have to make sure he doesn't add a rule for the root tag as that would force the module to attempt to build a datastructure for the whole document before writing anything! Feeding chunks of the file to XML::Rules is not (yet) supported. Seems it would not be hard to do though, XML::Parser::Expat has support for that. Update 2012-09-27: Right, adding the chunk processing support was not hard. I did not release the new version yet as I did not have time to write proper tests for this and one more change but if you are interested you can find the new version in the CPAN RT tracker. The code would then look something like this: `... $parser->filter_chunk('', "the_filtered.xml"); $ua->get($url, ':content_cb' => sub { my($data, $response, $protocol) += @_; $parser->filter_chunk($data); return 1 }); $parser->last_chunk();` [download] Jenda Enoch was right! Enjoy the last years of Rome.	[reply] [d/l]
Re^3: Best way to Download and Process a XML file by choroba (Cardinal) on Sep 25, 2012 at 17:57 UTC
I prefer and recommend XML::LibXML::Reader. لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ	[reply]
Re^3: Best way to Download and Process a XML file by dHarry (Abbot) on Sep 25, 2012 at 14:27 UTC
Of course!, building a tree of 150GB in memory... I still think Xerces is the best choice (available in multiple languages). I have parsed files up to 10-ish GB with it and it performed well.	[reply]
Re: Best way to Download and Process a XML file by BrowserUk (Patriarch) on Sep 25, 2012 at 15:02 UTC
The xml feed can be quite huge, (~150Gb max.) Often as not with XML files that big, the feed consists of one top level tag that contains a raft of much smaller, identical (except an ID) substructures: `<top> <sub name=1> ... </sub> <sub name=2> ... </sub> ... </top>` [download] It therefore becomes quite simple to do a preliminary parse of the datastream and break the huge dataset down into manageable chunks for processing: <Reveal this spoiler or all in this thread> Produces: `C:\test>\perl64-10\bin\perl 995446.pl { "sub" => { name => 1, subsub => "\n some stuff\n " } +, } { "sub" => { name => 2, subsub => "\n some stuff\n " } +, } { "sub" => { name => 3, subsub => "\n some stuff\n " } +, } { "sub" => { name => 4, subsub => "\n some stuff\n " } +, } { "sub" => { name => 5, subsub => "\n some stuff\n " } +, }` [download] Of course, this 'breaks the rules' of XML processing, and requires you to assume some knowledge of the details of the XML you will be processing. But then the kinds of details required are usually, a) easily discovered; b) rarely change; c) easily catered for when and if they do change. So if you favour the pragmatism of getting the job done over more esoteric -- and revenue sink -- criteria such as 'being correct', bending the rules a little can save you a lot of time, effort and expense. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. RIP Neil Armstrong	[reply] [d/l] [select]


XP is just a number
	PerlMonks