Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

Libxml parser cosuming 100% cpu

by geek2882 (Initiate)
on Aug 11, 2018 at 06:50 UTC ( [id://1220224]=perlquestion: print w/replies, xml ) Need Help??

geek2882 has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monk how can i optimize the parser. its take 2.5 min to finish the job but cpu usage is 100% untill the job finish script. i have post a design and a small view of code
xml File open store on array of @lines(file contain record block) foreach line(@lines) #10Lakhs Line { $stringparse .=line # stringparse get the record (maybe of 100 lin +es which go to if block for parsing) IF(Endtags of block)# Execute 10000 times for each record($str +ingparse 100 lines xml) { $XML::LibXML::skipXMLDeclaration = 1; our $dom = XML::LibXML->load_xml(string => $stringparse); our $xml = $dom->documentElement; . . . #Some of use Api of my Code $method = $xml->getChildrenByTagName('Method')->to_literal; $Value = $xml->getChildrenByTagName("$Check")->to_literal; $bune = $xml->getChildrenByTagName('Number')->to_literal; if ($xml->findnodes('./Indi/Lost/true') ) { } if ( $xml->findnodes('./Indi/Lost/true') || $xml->findnodes('. +/Indi/Losgshshsht/false') ) { } if(($xml->findnodes('./nike'))[0]->firstChild){ ($xml->findnodes('./nike'))[0]->firstChild->setData($localS); } my $Tag =$dom->createElement('Nike_identifier') + $Tag->appendText($localS); $xml->addChild($Tag); $org= $xml->findnodes('./NikkooIdentifiers/abc')->to_literal; } }

Replies are listed 'Best First'.
Re: Libxml parser cosuming 100% cpu
by haukex (Archbishop) on Aug 11, 2018 at 07:38 UTC

    If you could provide something we can actually run to see the performance issues ourselves, we could provide specific suggestions - see Short, Self-Contained, Correct Example. Until then, some general optimization tips:

    • Use Devel::NYTProf to locate the places where your code is spending the most time.
    • Consider using a different XML parser, since XML::LibXML will load the entire DOM structure into memory by default, while XML::Twig is geared more towards processing a large file piece by piece, without loading the whole file into memory.
    • I see you doing foreach over an array of apparently 1 million (?) lines - this means you're keeping those lines in memory. It would most likely be more efficient to use a while(<$filehandle>) loop to read the file line-by-line, without loading it all into memory.
    • I see you repeating some findnodes calls twice, causing that effort to be doubled. Consider storing the results in a local variable so you can use it more than once.

    If you have a input file of 1 million lines, consider that maybe it'll just take some time to process, and that 100% CPU consumption during that time is normal.

      Thanks, (first point) But i think if i use XML::Twig then execution time will be more than 2.5min. second thing i got it XML::LibXML store the doc in memory but its a small record of 100 Lines around ? do you think XML::Twig reduce the cpu usage and fast the execution. (third point) Which findnodes you are talking about.
      i have some more question here Which one is fast there in each two opt +ion: 1). $method = $xml->getChildrenByTagName('Method')->to_literal; $method = $xml->fidnnodes('./Method')->to_literal; 2). $xml->findnodes('./Indi/Lost/true') $xml->exists('./Indi/Lost/true')
        But i think if i use XML::Twig then execution time will be more than 2.5min.

        When it comes to optimizations, I'd suggest not "guessing" which might be faster, but measuring and testing!

        i got it XML::LibXML store the doc in memory but its a small record of 100 Lines around ? do you think XML::Twig reduce the cpu usage and fast the execution.

        If I understand correctly that you're splitting your ~1 million line file into chunks of 100 lines and then processing those one at a time, then I would agree with the guess that XML::Twig might not give you a big speed boost (unless you have ridiculously long lines).

        On the other hand, if your ~1 million line file is one big, well-formed XML file, then you would be able to get rid of your custom "splitting" code and use XML::Twig to process the entire file, one "record" at a time. If you need help with that, you'd have to show us some sample input (see How do I post a question effectively? and Short, Self-Contained, Correct Example).

        Which findnodes you are talking about.

        I was talking about this:

        if ($xml->findnodes('./Indi/Lost/true') ) { } if ( $xml->findnodes('./Indi/Lost/true') || $xml->findnodes('./Indi/Lo +sgshshsht/false') ) { }

        Which is better written like this, to avoid the doubling of the findnodes call (Update: unless of course the first if block makes modifications to the document that would require the second if to re-run the findnodes):

        my $result = $xml->findnodes('./Indi/Lost/true'); if ( $result ) { } if ( $result || $xml->findnodes('./Indi/Losgshshsht/false') ) { }

        And a similar thing with $xml->findnodes('./nike'))[0]->firstChild.

        Which one is fast there in each two option

        I don't have the time to test right now, but the go-to module for this kind of comparison is Benchmark. But as I said before, measure where your code is spending the most time with Devel::NYTProf, and then optimize those places, instead of guessing and doing what might turn out to be an unnecessary micro-optimization.

        If your saying your XML doc has only ~100 nodes, then the problem isn't memory, then using a pull parser like XML::Twig (or XML::LibXML::Reader) won't help.

Re: Libxml parser cosuming 100% cpu
by tobyink (Canon) on Aug 11, 2018 at 08:17 UTC

    Consider using your operating system's ability to run processes at a lower priority. If you're using Linux, check out the nice command.

      If you're using Linux

      or *BSD or Solaris or in fact anything which is POSIX compliant.

      The OP wants their program to run faster, not slower.

        Personally I read it as an X/Y problem thing. It taking 2.5 minutes to run is bad because during those 2.5 minutes, the CPU is so busy that other programs freeze up. If the script didn't freeze up the computer, it taking 5 or even 10 minutes might be more acceptable.

        run faster, not slower

        Well, you could read it like have the rest of the programmes run nicer to make the important one faster.

        However, the nice(1) command may modify niceness both ways, so you can indeed use it to prioritize a single process.

        Still, with the process in question already clogging all available cpu, I wonder if it is any help at all.

        Cheers, Sören

        Créateur des bugs mobiles - let loose once, run everywhere.
        (hooked on the Perl Programming language)

Re: Libxml parser cosuming 100% cpu
by ikegami (Patriarch) on Aug 11, 2018 at 12:54 UTC

    So you're saying it takes 2.5 minutes to parse and extract information from 10,000 XML documents. That's only 15 milliseconds per document! I'm thinking "Holy shit that's fast!"

      actually i have xml files of 10 lakhs tags. my script take 2.5 min to complete their job.simply i store the files lines into a array and run loop.i give 100 lines to libxml parser so each time parser process 100lines string this job continue untill the loop finish.Now problem is cpu usage which 100% till the job finish.my point is how can i reduce it without sleep command

        You bring up 100% CPU as if it's a bad thing again, but 100% CPU is a good thing. It means no time is being wasted waiting for I/O.

        Think of it this way: Would you rather have an employee that works at 100% of the time they are at work, or 50%?

        From your code so far, you read 1 million lines and then use a very inefficient method (lot of CPU and lot of memory) to make a string variable of those lines.

        I don't understand exactly what you mean by 100% CPU? My Windows machine has 4 cores which essentially means 4 CPU's that share a common big memory space.

        Unix is a time sharing O/S. Other processes will get CPU time even if one process is completely compute bound.

        I am not sure about these various XML Perl libs, but every time that your program runs an I/O operation, the O/S scheduler will run. Maybe use something that takes less memory and does more I/O? Every time you do an I/O operation, the O/S scheduler will run.

Re: Libxml parser cosuming 100% cpu
by Jenda (Abbot) on Aug 11, 2018 at 15:04 UTC

    What the fsck?!?

    You choose to use a gulp-everything-and-transform-into-insane-maze-of-objects style XML parser, find out it's gonna choke on your document so you devise a "solution" that attempts to guess where the chunks end and then you fire the parser a thousand times to somehow handle it and quite possibly forget to clean the buffer so you end up parsing the same stuff over and over and over again? And just so that you waste more memory you first load the document into an array of lines? Seriously?

    Do yourself a favour, scratch the array and the loop, forget about XML::LibXML and use a parser that will let you handle the file in chunks. Say XML::Twig or XML::Rules.

    getChildrenByTagName, sweet jeesus!

    By the way, it's cute that you have stovky tisíc lines.

    Jenda
    Enoch was right!
    Enjoy the last years of Rome.

Re: Libxml parser cosuming 100% cpu
by Anonymous Monk on Aug 12, 2018 at 12:39 UTC
    use less!
    use less 'CPU'; if (less->of('CPU')) { #10Lakhs Line }

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1220224]
Approved by marto
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others perusing the Monastery: (2)
As of 2024-04-19 20:45 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found