Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation

XML data extraction

by (Beadle)
on Oct 11, 2017 at 11:31 UTC ( #1201152=perlquestion: print w/replies, xml ) Need Help?? has asked for the wisdom of the Perl Monks concerning the following question:

Hello, I have complex XML file (around 1000 of records) and need to retrieve few elements based on the condition...

1.want to read node only where ciType="application" the value of entry where key="Last Status Change" o/p:10/2/2017 10:16 PM

<nodes> <node name="ABC" id="1" ciType="null"> <children> <node name="Business Systems" classification="null" ciType +="ci_collection"> <dimension name="Unresolved Events" status="-2" id +="3232"> </dimension> <dimension name="Availability" status="10" id="123 +13"> </dimension> <children> <node name="3YP (421)" classification="null" ciTyp +e="application" > <dimension name="Availability" status="20" + id="2312"> <body> <entry key="Status"> <![CDATA[ OK ]]> </entry> <entry key="Business Rule"> <![CDATA[ Percentage Rule ]]> </entry> <entry key="Last Status Change"> <![CDATA[ 10/2/2017 10:16 PM ]]> </entry> </body> </dimension> </node> </children> </node> </children> </node> </nodes>

sample code

#!/usr/bin/perl use strict; use warnings; use XML::XPath; use XML::LibXML::NodeList; use Data::Dumper; my @records; my $bamxml = 'OpenApi.xml'; my $bamxp = XML::XPath->new(filename => $bamxml); my $bamxpath = $bamxp->findnodes('//nodes/node/children/node/chil +dren/node'); my $pattern = shift; my $matches = XML::LibXML::NodeList->new; foreach my $bamnode ($bamxpath->get_nodelist) { my $name = $bamxp->find('./@name',$bamnode)->string_value; my $citype = $bamxp->find('./@ciType',$bamnode)->string_va +lue; my $status = $bamxp->find('./dimension/@status',$bamnode)->str +ing_value; my $time = $bamxp->find("./dimension/body/entry",$bamnode)->st +ring_value; s/^\s+|\s+$//g for $name,$citype,$status,$time; push @records, { name => $name, citype => $citype, status => $status, Time => $time }; }

Required Output

$VAR1 = [ { 'status' => '20', 'citype' => 'application', 'name' => '3YP (421)', 'Time' => '10/2/2017 10:16 PM' } ];

Replies are listed 'Best First'.
Re: XML data extraction (updated x2)
by haukex (Abbot) on Oct 11, 2017 at 11:58 UTC

    Whenever I hear "big XML file" I think XML::Twig, as this can efficiently process the XML file record by record without loading the whole thing into memory. The following gives you the desired output. As for your example code, I don't think you can mix XML::XPath with XML::LibXML - I think it'd be best if you use only tried to use the operations provided by XML::XPath.

    use warnings; use strict; use XML::Twig; use Data::Dumper; my $file = 'OpenApi.xml'; my @records; XML::Twig->new( twig_roots => { '/nodes/node/children/node/children/node' => sub { my ($t, $elt) = @_; my $dim = $elt->first_child('dimension'); push @records, { name => $elt->att('name'), citype => $elt->att('ciType'), status => $dim->att('status'), Time => $dim->first_child('body') ->first_child('entry[@key="Last Status Change"]') ->text }; $t->purge; }, }, )->parsefile($file); print Dumper(\@records);

    Update: As for your code, it's just a matter of getting the XPath expression right, this also gives the desired output:

    use strict; use warnings; use XML::XPath; use Data::Dumper; my $bamxml = 'OpenApi.xml'; my $bamxp = XML::XPath->new(filename => $bamxml); my $bamxpath = $bamxp->findnodes('//nodes/node/children/node/children +/node'); my @records; foreach my $bamnode ($bamxpath->get_nodelist) { my $name = $bamxp->find('./@name',$bamnode)->string_value; my $citype = $bamxp->find('./@ciType',$bamnode)->string_value; my $status = $bamxp->find('./dimension/@status',$bamnode)->string_ +value; my $time = $bamxp->find('./dimension/body/entry[@key="Last Status +Change"]',$bamnode)->string_value; s/^\s+|\s+$//g for $name,$citype,$status,$time; push @records, { name => $name, citype => $citype, status => $status, Time => $time }; } print Dumper(\@records);

    Update 2: Oops, missed your requirement "want to read node only where ciType='application'". The same XPath that choroba showed works in my code samples: '/nodes/node/children/node/children/node[@ciType="application"]'

      Thanks haukex for help.

      It will be grateful if you can help to correct below two queries :

      Want to calculate the Time Stamp between date now and timereceived .

      Getting error in pattern not matching ,can u share the correct pattern for "10/10/2017 11:35 PM"

      #Begin###Calculate the time difference my $dtnow = DateTime->now; my $timereceived = "10/10/2017 11:35 PM"; my $strp = DateTime::Format::Strptime->new(on_error=>'croak',pattern = +> '%m/%d/%Y %H:%M %t', time_zone=>'UTC'); my $dtevent = $strp->parse_datetime($timereceived); my $diff_sec = $dtnow->subtract_datetime_absolute($dtevent)->in_units( +'seconds'); my $diff_hours = sprintf("%.0f" , $diff_sec/(60*60)); #End###Calculate the time difference

      Another query in expression formatting ---

      my name = 'greenfield (Glossary) (100)' foreach ( $name =~ /\((.*?)\)/ ) { $appID = $1; }

      variable $name is having two value in two different brackets (Glossary) and (100) with below regular expression i am getting output as

      'appid' => 'Glossary'

      But i want 'appid' => '100'

      it should avoid the first bracket (Glossary) values and only last (100) bracket vales it should pick


        Since these are new questions unrelated to the rest of the thread, it would be best to post it in a new SoPW thread (but please don't re-post now).

        pattern for "10/10/2017 11:35 PM" ... '%m/%d/%Y %H:%M %t'

        Have a look at the DateTime::Format::Strptime docs - instead of %t you need to use the pattern that matches AM/PM, and instead of %H for 24-hour time you need to use the pattern which matches 12-hour times.

        expression formatting ... two different brackets (Glossary) and (100)

        Sorry but a single example is not enough to help with a regular expression. For example, can you be sure there will always be exactly two sets of parens in the string? Might there be characters after the second set of parens? Might there even be nested parens? And what strings shouldn't match the regex? Please see How to ask better questions using Test::More and sample data as well as my post here.

        Since this question is relatively basic, now might be a good time to review perlrequick and/or perlretut. You might find anchors (like ^ and $) to be useful, but again, that depends on what the various strings you're matching against look like. Also, regex101 can be a useful tool - note that it is not compatible with some of Perl's more advanced features, but for basic things can be very useful.

        Minor edits for clarity.

Re: XML data extraction
by choroba (Bishop) on Oct 11, 2017 at 12:29 UTC
    This shows how you can get the structure in XML::LibXML :
    #!/usr/bin/perl use warnings; use strict; use XML::LibXML; use Data::Dumper; my $bamxml = 'file.xml'; my $dom = 'XML::LibXML'->load_xml(location => $bamxml); my $apps = $dom->findnodes('/nodes/node/children/node/children/node[@c +iType="application"]'); my @records; for my $bamnode (@$apps) { my $name = $bamnode->findvalue('@name'); my $citype = $bamnode->findvalue('@ciType'); my $status = $bamnode->findvalue('dimension/@status'); my $time = $bamnode->findvalue('normalize-space(dimension/body/e +ntry[@key="Last Status Change"])'); push @records, { name => $name, citype => $citype, status => $status, Time => $time }; } print Dumper \@records;

    ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,
Re: XML data extraction
by thanos1983 (Vicar) on Oct 11, 2017 at 11:40 UTC


    Is there any sample of code? Did you try something and did not work?

    Are you getting an error on your code?

    Update: When I execute your code including the line (print Dumper \@records) I am getting:

    $VAR1 = [ { 'Time' => 'OK', 'name' => '3YP (421)', 'status' => '20', 'citype' => 'application' } ];

    Based on your desired output the only difference that I see is the time. Is this what you are having problem retrieving?

    Looking forward to your update. BR / Thanos

    Seeking for Perl wisdom...on the process of learning...not there...yet!
      Sample code i Have share but it need updation base on given 2 condition for required output..Thank you for help.
      #!/usr/bin/perl use strict; use warnings; use XML::XPath; use XML::LibXML::NodeList; use Data::Dumper; my @records; my $bamxml = 'BAMOpenApi.xml'; my $bamxp = XML::XPath->new(filename => $bamxml); my $bamxpath = $bamxp->findnodes('//nodes/node/children/node/chil +dren/node'); my $pattern = shift; my $matches = XML::LibXML::NodeList->new; foreach my $bamnode ($bamxpath->get_nodelist) { my $name = $bamxp->find('./@name',$bamnode)->string_value; my $citype = $bamxp->find('./@ciType',$bamnode)->string_value; my $status = $bamxp->find('./dimension/@status',$bamnode)->str +ing_value; my $time = $bamxp->find("./dimension/body/entry",$bamnode)->st +ring_value; s/^\s+|\s+$//g for $name,$citype,$status,$time; push @records, { name => $name, citype => $citype, status => $status, Time => $time }; }
Re: XML data extraction
by Anonymous Monk on Oct 11, 2017 at 17:51 UTC

    Insofar as possible, do not write Perl code that must match the structure of an XML construct: use XPath for its intended purpose. XML::LibXML includes complete XPath support, thanks to the libxml2 binary library which is an industry standard used by many, many toolsets. Even if you cannot construct an XPath expression that exactly matches what you are looking for (or if you simply do not want to take the time to try ...), XPath can certainly hand you a simple list through which your Perl code can now simply iterate.

    (Also bear in mind that most spreadsheet(!) tools also know about XML and XPath, such that sometimes you can avoid the actual business need for "a custom (Perl or otherwise) program" ... altogether. The very best program is the one that you actually didn't have to write, and this is often the case with XML.)

      1. Using XML::LibXML and XPath means you have to load the whole document into memory as a huge maze of interconnected objects. Good luck doing that with a file that actually is big. Not that it would not be a huge waste of resources even if you are able to fit it in memory.
      2. XPath is just another language to write a (part of a) program with. As soon as you are writing XPath, you are programming so the blurb about not having to write a program is nonsense. Yeah, you do not write it in Perl and use instead XPath combined with whatever expression and scripting language your spreadsheet provides. Big difference.

      Enoch was right!
      Enjoy the last years of Rome.

        > you have to load the whole document into memory

        That's not true. You can use XML::LibXML::Reader which is a pull parser, kind of like a SAX parser with the whole power of XML::LibXML available on request.

        > huge

        The OP mentions "1000 of records". That doesn't sound really huge to today's standards.

        ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,
Re: XML data extraction
by holli (Monsignor) on Oct 11, 2017 at 12:10 UTC
    This is the worst kind of question: the "here is what i copy-pastad from slashdot 4 years ago and now i need someone to fix it for free" - kind.
    While I can understand the need for quick fixes when deadlines loom and the desire to spend the evening with friends instead of homework, I will not help an OP like that who wants to outsorce us as free developers.


    You can lead your users to water, but alas, you cannot drown them.

      Yeah, we get lots of bad questions here, so I completely understand the general frustration. However, having replied to this poster before, I think is one of the better wisdom seekers: usually provides enough information so the questions can be answered in a fairly straightforward manner, AFAICT takes our advice to write their own code (example) - unless I've missed something and you've found where the code was copied from? Even without that background, the root node is well-formed: sample input, runnable, mostly well-formatted code with only two unnecessary lines, expected output - all the stuff we love to complain about loudly when it's missing ;-) So all that taken together is why I had no problem providing code in this case. (And in the end the fix to the original code turned out to be a two-line patch.)

      Sorry ,my intend was not like that.

      Boy holli, you sure brought a ray of sunshine when you decided to come back around. What is your purpose here?

      The way forward always starts with a minimal test.
        To learn perl and seek help if any one can. Thank you sir.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1201152]
Front-paged by Corion
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others studying the Monastery: (15)
As of 2018-06-22 15:29 GMT
Find Nodes?
    Voting Booth?
    Should cpanminus be part of the standard Perl release?

    Results (124 votes). Check out past polls.