Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Re^4: XML::Twig and threads

by BrowserUk (Pope)
on Nov 29, 2012 at 00:36 UTC ( #1006133=note: print w/ replies, xml ) Need Help??


in reply to Re^3: XML::Twig and threads
in thread XML::Twig and threads [solved]

I think the real problem is even worse in that with the real, 10MB & 100MB XML files, he is moving his process into swapping hence everything slows to a crawl.


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

RIP Neil Armstrong


Comment on Re^4: XML::Twig and threads
Re^5: XML::Twig and threads
by remiah (Hermit) on Nov 29, 2012 at 02:18 UTC

    Hello, BrowserUK.

    I see. Example XML is just 613KB.

    Copying Twig object is terribly slow. I guess Data::Dumper or dclone of storable will not do any good, because it is just huge.

    for ( $someData->children_copy( 'managedObject') ){ handle_managedObject($t, $_); }
    Without copy, it is very fast.
    for ( $someData->children( 'managedObject') ){ handle_managedObject($t, $_); }
    So, I vaguely imagined rewriting managedObject sub using regex, for example ...
    my ($t, $element)=@_;
    
    # create rewrite rules using Twig 
    my %rewrite_rules =(
        q/name="name"/ => "some value",
    );
    
    
    #replace with regex
    my $buffer=$element->sprint; #get plain text of element
    for (keys %rewrite_rules){
        $buffer =~ s/ $_  (.*?)  >  .*?  (?=<)
                    /${_} ${1} $rewrite_rules{$_}/sx;
    }
    
    #just print out without changing $element
    print $fh $buffer;
    
    
    I will do like this, if I were.

    Regards and thanks for your response.

      I vaguely imagined rewriting managedObject sub using regex,

      Given the size of the OPs target files -- 100MB -- I'd drop the use of XML parser entirely and process the file as text.

      That statement is tantamount to blasphemy around here, but it would be quicker, simpler and would work. And nobody would think twice if the files weren't labeled "XML".


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

      RIP Neil Armstrong

        Me too. :)

        - tye        

        Haha, that was good one. I tried mentioned replacement and it's under 2 minutes for 90MB XML file, so I would expect 0.5h for max data set. Thanks for hints!

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1006133]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others romping around the Monastery: (4)
As of 2014-07-13 01:58 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    When choosing user names for websites, I prefer to use:








    Results (245 votes), past polls