Re: Libxml parser cosuming 100% cpu
by haukex (Archbishop) on Aug 11, 2018 at 07:38 UTC
|
If you could provide something we can actually run to see the performance issues ourselves, we could provide specific suggestions - see Short, Self-Contained, Correct Example. Until then, some general optimization tips:
- Use Devel::NYTProf to locate the places where your code is spending the most time.
- Consider using a different XML parser, since XML::LibXML will load the entire DOM structure into memory by default, while XML::Twig is geared more towards processing a large file piece by piece, without loading the whole file into memory.
- I see you doing foreach over an array of apparently 1 million (?) lines - this means you're keeping those lines in memory. It would most likely be more efficient to use a while(<$filehandle>) loop to read the file line-by-line, without loading it all into memory.
- I see you repeating some findnodes calls twice, causing that effort to be doubled. Consider storing the results in a local variable so you can use it more than once.
If you have a input file of 1 million lines, consider that maybe it'll just take some time to process, and that 100% CPU consumption during that time is normal.
| [reply] [d/l] [select] |
|
Thanks,
(first point)
But i think if i use XML::Twig then execution time will be more than 2.5min. second thing i got it XML::LibXML store the doc in memory but its a small record of 100 Lines around ? do you think XML::Twig reduce the cpu usage and fast the execution.
(third point)
Which findnodes you are talking about.
i have some more question here Which one is fast there in each two opt
+ion:
1).
$method = $xml->getChildrenByTagName('Method')->to_literal;
$method = $xml->fidnnodes('./Method')->to_literal;
2).
$xml->findnodes('./Indi/Lost/true')
$xml->exists('./Indi/Lost/true')
| [reply] [d/l] |
|
But i think if i use XML::Twig then execution time will be more than 2.5min.
When it comes to optimizations, I'd suggest not "guessing" which might be faster, but measuring and testing!
i got it XML::LibXML store the doc in memory but its a small record of 100 Lines around ? do you think XML::Twig reduce the cpu usage and fast the execution.
If I understand correctly that you're splitting your ~1 million line file into chunks of 100 lines and then processing those one at a time, then I would agree with the guess that XML::Twig might not give you a big speed boost (unless you have ridiculously long lines).
On the other hand, if your ~1 million line file is one big, well-formed XML file, then you would be able to get rid of your custom "splitting" code and use XML::Twig to process the entire file, one "record" at a time. If you need help with that, you'd have to show us some sample input (see How do I post a question effectively? and Short, Self-Contained, Correct Example).
Which findnodes you are talking about.
I was talking about this:
if ($xml->findnodes('./Indi/Lost/true') ) {
}
if ( $xml->findnodes('./Indi/Lost/true') || $xml->findnodes('./Indi/Lo
+sgshshsht/false') ) {
}
Which is better written like this, to avoid the doubling of the findnodes call (Update: unless of course the first if block makes modifications to the document that would require the second if to re-run the findnodes):
my $result = $xml->findnodes('./Indi/Lost/true');
if ( $result ) {
}
if ( $result || $xml->findnodes('./Indi/Losgshshsht/false') ) {
}
And a similar thing with $xml->findnodes('./nike'))[0]->firstChild.
Which one is fast there in each two option
I don't have the time to test right now, but the go-to module for this kind of comparison is Benchmark. But as I said before, measure where your code is spending the most time with Devel::NYTProf, and then optimize those places, instead of guessing and doing what might turn out to be an unnecessary micro-optimization.
| [reply] [d/l] [select] |
|
| [reply] |
Re: Libxml parser cosuming 100% cpu
by tobyink (Canon) on Aug 11, 2018 at 08:17 UTC
|
Consider using your operating system's ability to run processes at a lower priority. If you're using Linux, check out the nice command.
| [reply] |
|
| [reply] |
|
| [reply] |
|
Personally I read it as an X/Y problem thing. It taking 2.5 minutes to run is bad because during those 2.5 minutes, the CPU is so busy that other programs freeze up. If the script didn't freeze up the computer, it taking 5 or even 10 minutes might be more acceptable.
| [reply] |
|
run faster, not slower
Well, you could read it like have the rest of the programmes run nicer to make the important one faster.
However, the nice(1) command may modify niceness both ways, so you can indeed use it to prioritize a single process.
Still, with the process in question already clogging all available cpu, I wonder if it is any help at all.
Cheers, Sören
Créateur des bugs mobiles - let loose once, run everywhere.
(hooked on the Perl Programming language)
| [reply] |
|
|
|
Re: Libxml parser cosuming 100% cpu
by ikegami (Patriarch) on Aug 11, 2018 at 12:54 UTC
|
So you're saying it takes 2.5 minutes to parse and extract information from 10,000 XML documents. That's only 15 milliseconds per document! I'm thinking "Holy shit that's fast!"
| [reply] |
|
actually i have xml files of 10 lakhs tags. my script take 2.5 min to complete their job.simply i store the files lines into a array and run loop.i give 100 lines to libxml parser so each time parser process 100lines string this job continue untill the loop finish.Now problem is cpu usage which 100% till the job finish.my point is how can i reduce it without sleep command
| [reply] |
|
You bring up 100% CPU as if it's a bad thing again, but 100% CPU is a good thing. It means no time is being wasted waiting for I/O.
Think of it this way: Would you rather have an employee that works at 100% of the time they are at work, or 50%?
| [reply] |
|
|
|
|
From your code so far, you read 1 million lines and then use a very inefficient method (lot of CPU and lot of memory) to make a string variable of those lines.
I don't understand exactly what you mean by 100% CPU? My Windows machine has 4 cores which essentially means 4 CPU's that share a common big memory space.
Unix is a time sharing O/S. Other processes will get CPU time even if one process is completely compute bound.
I am not sure about these various XML Perl libs, but every time that your program runs an I/O operation, the O/S scheduler will run. Maybe use something that takes less memory and does more I/O? Every time you do an I/O operation, the O/S scheduler will run.
| [reply] |
Re: Libxml parser cosuming 100% cpu
by Jenda (Abbot) on Aug 11, 2018 at 15:04 UTC
|
What the fsck?!?
You choose to use a gulp-everything-and-transform-into-insane-maze-of-objects style XML parser, find out it's gonna choke on your document so you devise a "solution" that attempts to guess where the chunks end and then you fire the parser a thousand times to somehow handle it and quite possibly forget to clean the buffer so you end up parsing the same stuff over and over and over again? And just so that you waste more memory you first load the document into an array of lines? Seriously?
Do yourself a favour, scratch the array and the loop, forget about XML::LibXML and use a parser that will let you handle the file in chunks. Say XML::Twig or XML::Rules.
getChildrenByTagName, sweet jeesus!
By the way, it's cute that you have stovky tisíc lines.
Jenda
Enoch was right!
Enjoy the last years of Rome.
| [reply] [d/l] |
Re: Libxml parser cosuming 100% cpu
by Anonymous Monk on Aug 12, 2018 at 12:39 UTC
|
use less 'CPU';
if (less->of('CPU')) {
#10Lakhs Line
}
| [reply] [d/l] |