Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

Best way to Download and Process a XML file

by perl_gog (Initiate)
on Sep 24, 2012 at 22:04 UTC ( [id://995446]=perlquestion: print w/replies, xml ) Need Help??

perl_gog has asked for the wisdom of the Perl Monks concerning the following question:

Hello perl-monks,
I am trying to:
a) download a xml file
b) do some basic processing on this xml file and
c) finally save it.

I was wondering what is the best option to do this:

option 1.
system("wget -O - 'http://host/getFeed.xml' > /tmp/myfeed.xml"); //d +ownload + save as tmp file.. open FH, "</tmp/myfeed.xml"; while(<FH>) { //manipulation steps here.. } close FH;

Option 2:
Is there a way to process the file, as its being downloaded? (instead of having to save it temporarily, and then reading it to process it) ?

I dont know how to implement option#2. Do you have any suggestions?
The xml feed can be quite huge, (~150Gb max.) As, the feed is huge, I am hoping there's a better option compared to option #1, as writing this file first to some tmp file and then saving it again would mean more disk-activity.

thanks!

Replies are listed 'Best First'.
Re: Best way to Download and Process a XML file
by tobyink (Canon) on Sep 24, 2012 at 22:30 UTC

    150 GB? Ouch.

    AnyEvent::HTTP should allow you to issue an HTTP request, and process it a chunk at a time, while it arrives, without having to save it anywhere.

    And XML::Twig can parse XML chunk by chunk.

    Pairing the two you ought to be able to do this without temporary files. Exactly how to do it, I can't help you. I have limited experience with AnyEvent::HTTP; and virtually none with XML::Twig.

    perl -E'sub Monkey::do{say$_,for@_,do{($monkey=[caller(0)]->[3])=~s{::}{ }and$monkey}}"Monkey say"->Monkey::do'
Re: Best way to Download and Process a XML file
by remiah (Hermit) on Sep 25, 2012 at 01:50 UTC

    XML::Twig has parseurl() method and I guess it is what you are looking for.

    Twig reference on CPAN
    twig site,includes tutorial

    I don't have experience for huge XML processing. purge() or flash() would have great effect on 150Gb XML. I would like to hear your impression, if I could...

    regards

Re: Best way to Download and Process a XML file
by dHarry (Abbot) on Sep 25, 2012 at 12:05 UTC

    Sanity check: 150GB XML file??? Maybe it's time to rethink the problem?!

    Assuming enough disk space and patience option 1 will work.

    Option 2 also has its drawbacks, e.g. "finally save it" sounds to me like keeping the file in memory... Or do you want to edit the file "in place"? Anyway, with XML files this big you probably don't want a pure Perl implementation. XML::LibXML jumps to mind. I have happy experience parsing big XML files (10s of GB) Xerces.

    Cheers

    Harry

      I do hope you meant XML::LibXML::SAX. The thing is that what's normally meant under XML::LibXML is a DOM style parser, that is something that slurps the whole XML into memory and creates a maze of objects. In case of XML::LibXML the objects reside in the C land so they do not waste as much space as they would if they were plain Perl objects, but still with a huge XML this is not a good candidate. Even if the docs make some sense to you.

      If perl_gog can convince some HTTP library to give him a filehandle from which he can read the decoded data of the response, he could use XML::Rules in the filter mode and print the transformed XML directly into a file with just some buffers and a twig from the XML kept in memory. Of course he'd have to make sure he doesn't add a rule for the root tag as that would force the module to attempt to build a datastructure for the whole document before writing anything! Feeding chunks of the file to XML::Rules is not (yet) supported. Seems it would not be hard to do though, XML::Parser::Expat has support for that.

      Update 2012-09-27: Right, adding the chunk processing support was not hard. I did not release the new version yet as I did not have time to write proper tests for this and one more change but if you are interested you can find the new version in the CPAN RT tracker. The code would then look something like this:

      ... $parser->filter_chunk('', "the_filtered.xml"); $ua->get($url, ':content_cb' => sub { my($data, $response, $protocol) += @_; $parser->filter_chunk($data); return 1 }); $parser->last_chunk();

      Jenda
      Enoch was right!
      Enjoy the last years of Rome.

        I prefer and recommend XML::LibXML::Reader.
        لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ

        Of course!, building a tree of 150GB in memory...

        I still think Xerces is the best choice (available in multiple languages). I have parsed files up to 10-ish GB with it and it performed well.

Re: Best way to Download and Process a XML file
by BrowserUk (Patriarch) on Sep 25, 2012 at 15:02 UTC
    The xml feed can be quite huge, (~150Gb max.)

    Often as not with XML files that big, the feed consists of one top level tag that contains a raft of much smaller, identical (except an ID) substructures:

    <top> <sub name=1> ... </sub> <sub name=2> ... </sub> ... </top>

    It therefore becomes quite simple to do a preliminary parse of the datastream and break the huge dataset down into manageable chunks for processing:

    #! perl -slw use strict; use Data::Dump qw[ pp ]; use XML::Simple; # open DATA, '<', 'stream'; my $enc = <DATA>; my $bot = my $top = <DATA>; $bot =~ s[^<(\w+).*][</$1>]s; my $section = ''; until( ( my $line = <DATA> ) =~ m[$bot] ) { my( $tag ) = $line =~ m[<(\w+)]; my $end = "</$tag>"; $section .= $line; $section .= <DATA> until $section =~ m[$end\s*$]; my $ref = XMLin( $enc . $top . $section . $bot ); ## do something with this section pp $ref; $section = ''; } __DATA__ <?xml version="1.0" encoding="UTF-8"?> <top> <sub name='1'> <subsub> some stuff </subsub> </sub> <sub name='2'> <subsub> some stuff </subsub> </sub> <sub name='3'> <subsub> some stuff </subsub> </sub> <sub name='4'> <subsub> some stuff </subsub> </sub> <sub name='5'> <subsub> some stuff </subsub> </sub> </top>

    Produces:

    C:\test>\perl64-10\bin\perl 995446.pl { "sub" => { name => 1, subsub => "\n some stuff\n " } +, } { "sub" => { name => 2, subsub => "\n some stuff\n " } +, } { "sub" => { name => 3, subsub => "\n some stuff\n " } +, } { "sub" => { name => 4, subsub => "\n some stuff\n " } +, } { "sub" => { name => 5, subsub => "\n some stuff\n " } +, }

    Of course, this 'breaks the rules' of XML processing, and requires you to assume some knowledge of the details of the XML you will be processing. But then the kinds of details required are usually, a) easily discovered; b) rarely change; c) easily catered for when and if they do change.

    So if you favour the pragmatism of getting the job done over more esoteric -- and revenue sink -- criteria such as 'being correct', bending the rules a little can save you a lot of time, effort and expense.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    RIP Neil Armstrong

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://995446]
Approved by kcott
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others surveying the Monastery: (4)
As of 2024-04-25 20:49 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found