Libxml parser cosuming 100% cpu

geek2882 has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Libxml parser cosuming 100% cpu by haukex (Archbishop) on Aug 11, 2018 at 07:38 UTC
If you could provide something we can actually run to see the performance issues ourselves, we could provide specific suggestions - see Short, Self-Contained, Correct Example. Until then, some general optimization tips: Use Devel::NYTProf to locate the places where your code is spending the most time. Consider using a different XML parser, since XML::LibXML will load the entire DOM structure into memory by default, while XML::Twig is geared more towards processing a large file piece by piece, without loading the whole file into memory. I see you doing `foreach` over an array of apparently 1 million (?) lines - this means you're keeping those lines in memory. It would most likely be more efficient to use a `while(<$filehandle>)` loop to read the file line-by-line, without loading it all into memory. I see you repeating some `findnodes` calls twice, causing that effort to be doubled. Consider storing the results in a local variable so you can use it more than once. If you have a input file of 1 million lines, consider that maybe it'll just take some time to process, and that 100% CPU consumption during that time is normal.	[reply] [d/l] [select]
Re^2: Libxml parser cosuming 100% cpu by geek2882 (Initiate) on Aug 11, 2018 at 11:43 UTC
Thanks, (first point) But i think if i use XML::Twig then execution time will be more than 2.5min. second thing i got it XML::LibXML store the doc in memory but its a small record of 100 Lines around ? do you think XML::Twig reduce the cpu usage and fast the execution. (third point) Which findnodes you are talking about. `i have some more question here Which one is fast there in each two opt +ion: 1). $method = $xml->getChildrenByTagName('Method')->to_literal; $method = $xml->fidnnodes('./Method')->to_literal; 2). $xml->findnodes('./Indi/Lost/true') $xml->exists('./Indi/Lost/true')` [download]	[reply] [d/l]
Re^3: Libxml parser cosuming 100% cpu by haukex (Archbishop) on Aug 11, 2018 at 13:37 UTC
But i think if i use XML::Twig then execution time will be more than 2.5min. When it comes to optimizations, I'd suggest not "guessing" which might be faster, but measuring and testing! i got it XML::LibXML store the doc in memory but its a small record of 100 Lines around ? do you think XML::Twig reduce the cpu usage and fast the execution. If I understand correctly that you're splitting your ~1 million line file into chunks of 100 lines and then processing those one at a time, then I would agree with the guess that XML::Twig might not give you a big speed boost (unless you have ridiculously long lines). On the other hand, if your ~1 million line file is one big, well-formed XML file, then you would be able to get rid of your custom "splitting" code and use XML::Twig to process the entire file, one "record" at a time. If you need help with that, you'd have to show us some sample input (see How do I post a question effectively? and Short, Self-Contained, Correct Example). Which findnodes you are talking about. I was talking about this: `if ($xml->findnodes('./Indi/Lost/true') ) { } if ( $xml->findnodes('./Indi/Lost/true') \|\| $xml->findnodes('./Indi/Lo +sgshshsht/false') ) { }` [download] Which is better written like this, to avoid the doubling of the `findnodes` call (Update: unless of course the first `if` block makes modifications to the document that would require the second `if` to re-run the `findnodes`): `my $result = $xml->findnodes('./Indi/Lost/true'); if ( $result ) { } if ( $result \|\| $xml->findnodes('./Indi/Losgshshsht/false') ) { }` [download] And a similar thing with `$xml->findnodes('./nike'))[0]->firstChild`. Which one is fast there in each two option I don't have the time to test right now, but the go-to module for this kind of comparison is Benchmark. But as I said before, measure where your code is spending the most time with Devel::NYTProf, and then optimize those places, instead of guessing and doing what might turn out to be an unnecessary micro-optimization.	[reply] [d/l] [select]
Re^3: Libxml parser cosuming 100% cpu by ikegami (Patriarch) on Aug 11, 2018 at 12:49 UTC
If your saying your XML doc has only ~100 nodes, then the problem isn't memory, then using a pull parser like XML::Twig (or XML::LibXML::Reader) won't help.	[reply]
Re: Libxml parser cosuming 100% cpu by tobyink (Canon) on Aug 11, 2018 at 08:17 UTC
Consider using your operating system's ability to run processes at a lower priority. If you're using Linux, check out the nice command. toby d�t ink	[reply]
Re^2: Libxml parser cosuming 100% cpu by hippo (Bishop) on Aug 11, 2018 at 08:58 UTC
If you're using Linux or *BSD or Solaris or in fact anything which is POSIX compliant.	[reply]
Re^2: Libxml parser cosuming 100% cpu by ikegami (Patriarch) on Aug 11, 2018 at 12:46 UTC
The OP wants their program to run faster, not slower.	[reply]
Re^3: Libxml parser cosuming 100% cpu by tobyink (Canon) on Aug 11, 2018 at 14:36 UTC
Personally I read it as an X/Y problem thing. It taking 2.5 minutes to run is bad because during those 2.5 minutes, the CPU is so busy that other programs freeze up. If the script didn't freeze up the computer, it taking 5 or even 10 minutes might be more acceptable. toby d�t ink	[reply]
Re^3: Libxml parser cosuming 100% cpu by Happy-the-monk (Canon) on Aug 11, 2018 at 12:55 UTC
run faster, not slower Well, you could read it like have the rest of the programmes run nicer to make the important one faster. However, the `nice(1)` command may modify niceness both ways, so you can indeed use it to prioritize a single process. Still, with the process in question already clogging all available cpu, I wonder if it is any help at all. Cheers, Sören Créateur des bugs mobiles - let loose once, run everywhere. (hooked on the Perl Programming language)	[reply]
Re^4: Libxml parser cosuming 100% cpu by ikegami (Patriarch) on Aug 11, 2018 at 12:58 UTC
Re^5: Libxml parser cosuming 100% cpu by Happy-the-monk (Canon) on Aug 11, 2018 at 13:24 UTC
Some notes below your chosen depth have not been shown here
Re: Libxml parser cosuming 100% cpu by ikegami (Patriarch) on Aug 11, 2018 at 12:54 UTC
So you're saying it takes 2.5 minutes to parse and extract information from 10,000 XML documents. That's only 15 milliseconds per document! I'm thinking "Holy shit that's fast!"	[reply]
Re^2: Libxml parser cosuming 100% cpu by geek2882 (Initiate) on Aug 11, 2018 at 13:05 UTC
actually i have xml files of 10 lakhs tags. my script take 2.5 min to complete their job.simply i store the files lines into a array and run loop.i give 100 lines to libxml parser so each time parser process 100lines string this job continue untill the loop finish.Now problem is cpu usage which 100% till the job finish.my point is how can i reduce it without sleep command	[reply]
Re^3: Libxml parser cosuming 100% cpu by ikegami (Patriarch) on Aug 11, 2018 at 13:06 UTC
You bring up 100% CPU as if it's a bad thing again, but 100% CPU is a good thing. It means no time is being wasted waiting for I/O. Think of it this way: Would you rather have an employee that works at 100% of the time they are at work, or 50%?	[reply]
Re^4: Libxml parser cosuming 100% cpu by geek2882 (Initiate) on Aug 11, 2018 at 13:08 UTC
Re^5: Libxml parser cosuming 100% cpu by ikegami (Patriarch) on Aug 11, 2018 at 13:10 UTC
Some notes below your chosen depth have not been shown here
Re^3: Libxml parser cosuming 100% cpu by Marshall (Canon) on Aug 12, 2018 at 09:54 UTC
From your code so far, you read 1 million lines and then use a very inefficient method (lot of CPU and lot of memory) to make a string variable of those lines. I don't understand exactly what you mean by 100% CPU? My Windows machine has 4 cores which essentially means 4 CPU's that share a common big memory space. Unix is a time sharing O/S. Other processes will get CPU time even if one process is completely compute bound. I am not sure about these various XML Perl libs, but every time that your program runs an I/O operation, the O/S scheduler will run. Maybe use something that takes less memory and does more I/O? Every time you do an I/O operation, the O/S scheduler will run.	[reply]
Re: Libxml parser cosuming 100% cpu by Jenda (Abbot) on Aug 11, 2018 at 15:04 UTC
What the fsck?!? You choose to use a gulp-everything-and-transform-into-insane-maze-of-objects style XML parser, find out it's gonna choke on your document so you devise a "solution" that attempts to guess where the chunks end and then you fire the parser a thousand times to somehow handle it and quite possibly forget to clean the buffer so you end up parsing the same stuff over and over and over again? And just so that you waste more memory you first load the document into an array of lines? Seriously? Do yourself a favour, scratch the array and the loop, forget about XML::LibXML and use a parser that will let you handle the file in chunks. Say XML::Twig or XML::Rules. `getChildrenByTagName`, sweet jeesus! By the way, it's cute that you have stovky tis�c lines. Jenda Enoch was right! Enjoy the last years of Rome.	[reply] [d/l]
Re: Libxml parser cosuming 100% cpu by Anonymous Monk on Aug 12, 2018 at 12:39 UTC
use less! `use less 'CPU'; if (less->of('CPU')) { #10Lakhs Line }` [download]	[reply] [d/l]


Your skill will accomplish what the force of many cannot
	PerlMonks