http://www.perlmonks.org?node_id=936828


in reply to XML:: Twig - can you check for text following the element being handled?

text

If you are using a handler on the la elements, that will be called as soon as the end tag for the la element is parsed. At that point, the text after that element is not yet in the document tree, so there is no way to access it.

Eg. take this XML
use warnings; use 5.014; our $doc = q{<?xml version="1.0" encoding="iso-8859-2" ?> <!-- Arany János: Toldi. negyedik ének, részlet. --> <verse> <line>Majd az édes álom pillangó képében</line> <line><la>Elvetődött</la> arra tarka köntösében,</line> <line>De nem <la>mert</la> szemére szállni még sokáig,</line> <line>Szinte a pirosló hajnal hasadtáig.</line> <line>Mert <la>félt</la> a szunyogtól, <la>félt</la> a szúrós +nádtól,</line> <line>Jobban a nádasnak csörtető vadától,</line> <line><la>Félt</la> az üldözőknek távoli zajától,</line> <line>De legis-legjobban Toldi nagy bajától.</line> </verse> };
(Sorry for the wavy ő. You can't use perlmonks code tags with non-iso-8859-1 text currently.)

and see what happens if you parse it with handlers for la elements installed:

our %xmlopt = ( keep_spaces => 1, comments => "drop", ); binmode STDOUT, ":encoding(iso-8859-2)"; if (1) { my $n; my $tw; my $la_handler = sub { my($tw1, $la) = @_; if ($n++ < 2) { print "In the handler for la elements. So far, the documen +t tree contains this: (((\n" . $tw->sprint . "\n)))\n"; } 1; }; $tw = XML::Twig->new( twig_handlers => {"la" => $la_handler}, %xmlopt, ); $tw->parse($doc); }
=begin output In the handler for la elements. So far, the document tree contains thi +s: ((( <?xml version="1.0" encoding="utf-8"?> <verse> <line>Majd az édes álom pillangó képében</line> <line><la>Elvetődött</la></line></verse> ))) In the handler for la elements. So far, the document tree contains thi +s: ((( <?xml version="1.0" encoding="utf-8"?> <verse> <line>Majd az édes álom pillangó képében</line> <line><la>Elvetődött</la> arra tarka köntösében,</line> <line>De nem <la>mert</la></line></verse> ))) =end output =cut

If your XML document isn't too large, then the easiest way to extract data from it is to parse it with no handlers, then find elements on it. This way, you have access to the whole document, including text after it.

Eg.

if (1) { my $tw = XML::Twig->new(%xmlopt); $tw->parse($doc); for my $la ($tw->findnodes("//la")) { my $t = $la->text; my $ta = $la->next_sibling_text; print "Found an la element. Its text is ((($t))). The text i +mmediately after is (($ta))).\n"; } }
=begin output Found an la element. Its text is (((Elvetődött))). The text immediat +ely after is (( arra tarka köntösében,))). Found an la element. Its text is (((mert))). The text immediately af +ter is (( szemére szállni még sokáig,))). Found an la element. Its text is (((félt))). The text immediately af +ter is (( a szunyogtól, ))). Found an la element. Its text is (((félt))). The text immediately af +ter is (( a szúrós nádtól,))). Found an la element. Its text is (((Félt))). The text immediately af +ter is (( az üldözőknek távoli zajától,))). =end output =cut

If your document is too large for this, it's still worth to use as few handlers as possible. For example, use a single handler for whatever element represents a whole headword entry, and in this handler, iterate on the la elements in this element. When that handler is executed, the whole entry have already been parsed, including the text after the la element, so you can access it.

if (1) { my $line_handler = sub { my($tw1, $li) = @_; print "In the line handler. Full line is (((" . $li->sprint . + ")))\n"; for my $la ($li->findnodes("//la")) { my $t = $la->text; my $ta = $la->next_sibling_text; print "Found an la element. Its text is ((($t))). The te +xt immediately after is (($ta))).\n"; } $tw1->purge; }; my $tw = XML::Twig->new( twig_handlers => {"line" => $line_handler}, %xmlopt ); $tw->parse($doc); }
=begin output In the line handler. Full line is (((<line>Majd az édes álom pillangó + képében</line>))) In the line handler. Full line is (((<line><la>Elvetődött</la> arra t +arka köntösében,</line>))) Found an la element. Its text is (((Elvetődött))). The text immediat +ely after is (( arra tarka köntösében,))). In the line handler. Full line is (((<line>De nem <la>mert</la> szemé +re szállni még sokáig,</line>))) Found an la element. Its text is (((mert))). The text immediately af +ter is (( szemére szállni még sokáig,))). In the line handler. Full line is (((<line>Szinte a pirosló hajnal ha +sadtáig.</line>))) In the line handler. Full line is (((<line>Mert <la>félt</la> a szuny +ogtól, <la>félt</la> a szúrós nádtól,</line>))) Found an la element. Its text is (((félt))). The text immediately af +ter is (( a szunyogtól, ))). Found an la element. Its text is (((félt))). The text immediately af +ter is (( a szúrós nádtól,))). In the line handler. Full line is (((<line>Jobban a nádasnak csörtető + vadától,</line>))) In the line handler. Full line is (((<line><la>Félt</la> az üldözőkne +k távoli zajától,</line>))) Found an la element. Its text is (((Félt))). The text immediately af +ter is (( az üldözőknek távoli zajától,))). In the line handler. Full line is (((<line>De legis-legjobban Toldi n +agy bajától.</line>))) =end output =cut

Replies are listed 'Best First'.
Re^2: XML:: Twig - can you check for text following the element being handled?
by mertserger (Curate) on Nov 09, 2011 at 10:10 UTC

    Thanks Ambrus, I thought that was the case but I wanted to make sure I hadn't overlooked anything.

    As I said, this is a legacy script I have inherited. Luckily the particular form of data does not occur often, so the user has agreed it is not worth a major rewrite of the code to accomodate it. As it is, it means occassionaly the validation script will raise a few warnings where actually the data is OK, but we can live with that.