XPATH DOM traverse html/xml

gsiglet has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: XPATH DOM traverse html/xml (build your own) by ikegami (Patriarch) on Dec 07, 2009 at 19:02 UTC
`use strict; use warnings; use XML::LibXML qw( XML_ELEMENT_NODE ); sub visit { my ($node, $path) = @_; $path = '' if !defined($path); print("$path: ", $node->nodeName(), "\n"); $path .= '/' if length($path) && $path !~ m{/\z}; my @children = grep $_->nodeType() == XML_ELEMENT_NODE, $node->childNodes(); visit($children[$_], "$path[".($_+1).']') for 0..$#children; } my $parser = XML::LibXML->new(); my $doc = $parser->parse_fh(STDIN); my $root = $doc->documentElement(); visit($root, '/');` [download] /: OTA_AirSeatMapRS //[1]: Success //[2]: SeatMapResponses //[2]/[1]: SeatMapResponse //[2]/[1]/[1]: FlightSegmentInfo //[2]/[1]/[1]/[1]: DepartureAirport //[2]/[1]/[1]/[2]: ArrivalAirport //[2]/[1]/[1]/[3]: OperatingAirline //[2]/[1]/[1]/[4]: MarketingAirline //[2]/[1]/[2]: SeatMapDetails //[2]/[1]/[2]/[1]: CabinClass //[2]/[1]/[2]/[1]/[1]: AirRows //[2]/[1]/[2]/[1]/[1]/[1]: AirRow //[2]/[1]/[2]/[1]/[1]/[1]/[1]: AirSeats //[2]/[1]/[2]/[1]/[1]/[1]/[1]/[1]: AirSeat //[2]/[1]/[2]/[1]/[1]/[1]/[1]/[2]: AirSeat //[2]/[1]/[2]/[1]/[1]/[1]/[1]/[3]: AirSeat //[2]/[1]/[2]/[1]/[1]/[1]/[1]/[4]: AirSeat //[2]/[1]/[2]/[1]/[1]/[1]/[1]/[5]: AirSeat //[2]/[1]/[2]/[1]/[1]/[1]/[1]/[6]: AirSeat //[2]/[1]/[2]/[1]/[1]/[1]/[2]: AirRowCharacteristics //[2]/[1]/[2]/[1]/[1]/[2]: AirRow //[2]/[1]/[2]/[1]/[1]/[2]/[1]: AirSeats //[2]/[1]/[2]/[1]/[1]/[2]/[1]/[1]: AirSeat //[2]/[1]/[2]/[1]/[1]/[2]/[1]/[2]: AirSeat ... //[2]/[2]/[2]/[1]/[1]/[21]/[1]/[5]: AirSeat //[2]/[2]/[2]/[1]/[1]/[21]/[1]/[6]: AirSeat //[2]/[2]/[2]/[1]/[1]/[21]/[2]: AirRowCharacteristics //[2]/[2]/[2]/[1]/[1]/[22]: AirRow //[2]/[2]/[2]/[1]/[1]/[22]/[1]: AirSeats //[2]/[2]/[2]/[1]/[1]/[22]/[1]/[1]: AirSeat //[2]/[2]/[2]/[1]/[1]/[22]/[1]/[2]: AirSeat //[2]/[2]/[2]/[1]/[1]/[22]/[1]/[3]: AirSeat //[2]/[2]/[2]/[1]/[1]/[22]/[1]/[4]: AirSeat //[2]/[2]/[2]/[1]/[1]/[22]/[1]/[5]: AirSeat //[2]/[2]/[2]/[1]/[1]/[22]/[1]/[6]: AirSeat //[2]/[2]/[2]/[1]/[1]/[22]/[2]: AirRowCharacteristics //[2]/[2]/[3]: BookingReferenceID //[2]/[3]: AirTravelers //[2]/[3]/[1]: AirTraveler //[2]/[3]/[1]/[1]: PersonName //[2]/[3]/[1]/[1]/[1]: GivenName //[2]/[3]/[1]/[1]/[2]: MiddleName //[2]/[3]/[1]/[1]/[3]: Surname //[2]/[3]/[1]/[1]/[4]: NameTitle //[2]/[3]/[1]/[2]: TravelerRefNumber //[2]/[3]/[2]: AirTraveler //[2]/[3]/[2]/[1]: PersonName //[2]/[3]/[2]/[1]/[1]: GivenName //[2]/[3]/[2]/[1]/[2]: MiddleName //[2]/[3]/[2]/[1]/[3]: Surname //[2]/[3]/[2]/[1]/[4]: NameTitle //[2]/[3]/[2]/[2]: TravelerRefNumber //[2]/[3]/[3]: AirTraveler //[2]/[3]/[3]/[1]: PersonName //[2]/[3]/[3]/[1]/[1]: GivenName //[2]/[3]/[3]/[1]/[2]: MiddleName //[2]/[3]/[3]/[1]/[3]: Surname //[2]/[3]/[3]/[1]/[4]: NameTitle //[2]/[3]/[3]/[2]: TravelerRefNumber [download] You can use other expressions for the path segments if you want. I kept it simple. And of course, you can do other things other than printing the path and element name. Update: Fixed bugs.	[reply] [d/l] [select]
Re: XPATH DOM traverse html/xml (->nodePath) by ikegami (Patriarch) on Dec 07, 2009 at 19:16 UTC
Turns out that XML::LibXML can do the work for you at the cost of flexibility: `use strict; use warnings; use XML::LibXML qw( ); my $parser = XML::LibXML->new(); my $doc = $parser->parse_fh(STDIN); my $root = $doc->documentElement(); for my $node ($root->findnodes('//')) { print($node->nodePath(), ': ', $node->nodeName(), "\n"); }` [download] /: OTA_AirSeatMapRS //[1]: Success //[2]: SeatMapResponses //[2]/[1]: SeatMapResponse //[2]/[1]/[1]: FlightSegmentInfo //[2]/[1]/[1]/[1]: DepartureAirport //[2]/[1]/[1]/[2]: ArrivalAirport //[2]/[1]/[1]/[3]: OperatingAirline //[2]/[1]/[1]/[4]: MarketingAirline //[2]/[1]/[2]: SeatMapDetails //[2]/[1]/[2]/: CabinClass //[2]/[1]/[2]//: AirRows //[2]/[1]/[2]///[1]: AirRow //[2]/[1]/[2]///[1]/[1]: AirSeats //[2]/[1]/[2]///[1]/[1]/[1]: AirSeat //[2]/[1]/[2]///[1]/[1]/[2]: AirSeat //[2]/[1]/[2]///[1]/[1]/[3]: AirSeat //[2]/[1]/[2]///[1]/[1]/[4]: AirSeat //[2]/[1]/[2]///[1]/[1]/[5]: AirSeat //[2]/[1]/[2]///[1]/[1]/[6]: AirSeat //[2]/[1]/[2]///[1]/[2]: AirRowCharacteristics //[2]/[1]/[2]///[2]: AirRow //[2]/[1]/[2]///[2]/[1]: AirSeats //[2]/[1]/[2]///[2]/[1]/[1]: AirSeat //[2]/[1]/[2]///[2]/[1]/[2]: AirSeat ... //[2]/[2]/[2]///[21]/[1]/[5]: AirSeat //[2]/[2]/[2]///[21]/[1]/[6]: AirSeat //[2]/[2]/[2]///[21]/[2]: AirRowCharacteristics //[2]/[2]/[2]///[22]: AirRow //[2]/[2]/[2]///[22]/[1]: AirSeats //[2]/[2]/[2]///[22]/[1]/[1]: AirSeat //[2]/[2]/[2]///[22]/[1]/[2]: AirSeat //[2]/[2]/[2]///[22]/[1]/[3]: AirSeat //[2]/[2]/[2]///[22]/[1]/[4]: AirSeat //[2]/[2]/[2]///[22]/[1]/[5]: AirSeat //[2]/[2]/[2]///[22]/[1]/[6]: AirSeat //[2]/[2]/[2]///[22]/[2]: AirRowCharacteristics //[2]/[2]/[3]: BookingReferenceID //[2]/[3]: AirTravelers //[2]/[3]/[1]: AirTraveler //[2]/[3]/[1]/[1]: PersonName //[2]/[3]/[1]/[1]/[1]: GivenName //[2]/[3]/[1]/[1]/[2]: MiddleName //[2]/[3]/[1]/[1]/[3]: Surname //[2]/[3]/[1]/[1]/[4]: NameTitle //[2]/[3]/[1]/[2]: TravelerRefNumber //[2]/[3]/[2]: AirTraveler //[2]/[3]/[2]/[1]: PersonName //[2]/[3]/[2]/[1]/[1]: GivenName //[2]/[3]/[2]/[1]/[2]: MiddleName //[2]/[3]/[2]/[1]/[3]: Surname //[2]/[3]/[2]/[1]/[4]: NameTitle //[2]/[3]/[2]/[2]: TravelerRefNumber //[2]/[3]/[3]: AirTraveler //[2]/[3]/[3]/[1]: PersonName //[2]/[3]/[3]/[1]/[1]: GivenName //[2]/[3]/[3]/[1]/[2]: MiddleName //[2]/[3]/[3]/[1]/[3]: Surname //[2]/[3]/[3]/[1]/[4]: NameTitle //[2]/[3]/[3]/*[2]: TravelerRefNumber [download]	[reply] [d/l] [select]
Re: XPATH DOM traverse html/xml (fancy) by ikegami (Patriarch) on Dec 07, 2009 at 20:20 UTC
ok, I can't leave this alone :) To get `/p1:root /p1:root/p1:foo[1] /p1:root/p1:foo[2] /p1:root/p1:bar[1] /p1:root/p1:bar[2] Prefix definitions: p1: http://www.example.org/tooty` [download] instead of `/* //[1] //[2] //[3] //[4]` [download] requires a lot more work. use strict; use warnings; use XML::LibXML qw( XML_ELEMENT_NODE ); sub _qname { my ($cx, $node) = @_; my $name = $node->nodeName(); my $uri = $node->namespaceURI(); return $name if !defined($uri); my $prefix = $cx->[1]{$uri}; if (!defined($prefix)) { $cx->[1]{$uri} = $prefix = 'p' . ++($cx->[0]); $cx->[2]->registerNs($prefix, $uri); } return "$prefix:$name"; } sub _visit { my ($cx, $path, $node) = @_; print("$path\n"); $path .= '/' if length($path) && $path !~ m{/\z}; my @children = grep $_->nodeType() == XML_ELEMENT_NODE, $node->childNodes(); my %idxs; for (0..$#children) { my $node = $children[$_]; my $qname = _qname($cx, $node); my $i = ++$idxs{$qname}; _visit($cx, "$path$qname\[$i]", $node); } } sub visit_node { my ($node) = @_; my $cx = [ 0, {}, XML::LibXML::XPathContext->new() ]; my $path = '.'; _visit($cx, $path, $node); print("\n"); print("Prefix definitions:\n"); for my $uri (keys(%{ $cx->[1] })) { my $prefix = $cx->[1]{$uri}; print("$prefix: $uri\n"); } } sub visit_doc { my ($doc) = @_; my $node = $doc->documentElement(); my $cx = [ 0, {}, XML::LibXML::XPathContext->new() ]; my $path = '/' . _qname($cx, $node); _visit($cx, $path, $node); print("\n"); print("Prefix definitions:\n"); for my $uri (keys(%{ $cx->[1] })) { my $prefix = $cx->[1]{$uri}; print("$prefix: $uri\n"); } } my $parser = XML::LibXML->new(); my $doc = $parser->parse_fh(*STDIN); visit_doc($doc); [download] The code doesn't use the XML::LibXML::XPathContext object it creates, but I figured you'd need it if you're building xpaths.	[reply] [d/l] [select]
Re^2: XPATH DOM traverse html/xml (fancy) by gsiglet (Acolyte) on Dec 07, 2009 at 20:43 UTC
Thank you very much! I will try your solution. Have a nice evening.	[reply]
Re: XPATH DOM traverse html/xml by mirod (Canon) on Dec 08, 2009 at 16:11 UTC
I don't know which module you use, but with XML::Twig you can use the `xpath` method like this, for example: `perl -MXML::Twig -E'my $t=XML::Twig->parse( "my.xml"); foreach my $e ($t->descendants( "#ELT")) { say $e->xpath; }'`	[reply] [d/l]
Re^2: XPATH DOM traverse html/xml by benizi (Hermit) on Dec 09, 2009 at 06:18 UTC
XML::Twig is one of the many things that keeps me using Perl. (It's one of the modules my personal 'M.pm' module loads -- 'M.pm' so I can do `perl -MM -lwe ...`.) So, thanks again, mirod. The first thing that came to mind was: `perl -MXML::Twig -E 'XML::Twig->new( twig_handlers => { _all_ => sub { say $_->xpath } })->parsefile(shift)'` ...which isn't so different from what you wrote. But, the following has the benefit of not requiring the loading of the entire file (correct?): `perl -MXML::Twig -E 'XML::Twig->new( start_tag_handlers => { _all_ => sub { say $_->xpath } })->parsefile(shift)'`	[reply] [d/l] [select]
Re^3: XPATH DOM traverse html/xml by mirod (Canon) on Dec 09, 2009 at 13:18 UTC
Absolutely, the way you wrote it is more efficient. I just went for the easiest way I could think of, not knowing enough about the OP's exact constraints.	[reply]


more useful options
	PerlMonks