Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

Comment on

( #3333=superdoc: print w/ replies, xml ) Need Help??
text

If you are using a handler on the la elements, that will be called as soon as the end tag for the la element is parsed. At that point, the text after that element is not yet in the document tree, so there is no way to access it.

Eg. take this XML
use warnings; use 5.014; our $doc = q{<?xml version="1.0" encoding="iso-8859-2" ?> <!-- Arany János: Toldi. negyedik ének, részlet. --> <verse> <line>Majd az édes álom pillangó képében</line> <line><la>Elvetődött</la> arra tarka köntösében,</line> <line>De nem <la>mert</la> szemére szállni még sokáig,</line> <line>Szinte a pirosló hajnal hasadtáig.</line> <line>Mert <la>félt</la> a szunyogtól, <la>félt</la> a szúrós +nádtól,</line> <line>Jobban a nádasnak csörtető vadától,</line> <line><la>Félt</la> az üldözőknek távoli zajától,</line> <line>De legis-legjobban Toldi nagy bajától.</line> </verse> };
(Sorry for the wavy ő. You can't use perlmonks code tags with non-iso-8859-1 text currently.)

and see what happens if you parse it with handlers for la elements installed:

our %xmlopt = ( keep_spaces => 1, comments => "drop", ); binmode STDOUT, ":encoding(iso-8859-2)"; if (1) { my $n; my $tw; my $la_handler = sub { my($tw1, $la) = @_; if ($n++ < 2) { print "In the handler for la elements. So far, the documen +t tree contains this: (((\n" . $tw->sprint . "\n)))\n"; } 1; }; $tw = XML::Twig->new( twig_handlers => {"la" => $la_handler}, %xmlopt, ); $tw->parse($doc); }
=begin output In the handler for la elements. So far, the document tree contains thi +s: ((( <?xml version="1.0" encoding="utf-8"?> <verse> <line>Majd az édes álom pillangó képében</line> <line><la>Elvetődött</la></line></verse> ))) In the handler for la elements. So far, the document tree contains thi +s: ((( <?xml version="1.0" encoding="utf-8"?> <verse> <line>Majd az édes álom pillangó képében</line> <line><la>Elvetődött</la> arra tarka köntösében,</line> <line>De nem <la>mert</la></line></verse> ))) =end output =cut

If your XML document isn't too large, then the easiest way to extract data from it is to parse it with no handlers, then find elements on it. This way, you have access to the whole document, including text after it.

Eg.

if (1) { my $tw = XML::Twig->new(%xmlopt); $tw->parse($doc); for my $la ($tw->findnodes("//la")) { my $t = $la->text; my $ta = $la->next_sibling_text; print "Found an la element. Its text is ((($t))). The text i +mmediately after is (($ta))).\n"; } }
=begin output Found an la element. Its text is (((Elvetődött))). The text immediat +ely after is (( arra tarka köntösében,))). Found an la element. Its text is (((mert))). The text immediately af +ter is (( szemére szállni még sokáig,))). Found an la element. Its text is (((félt))). The text immediately af +ter is (( a szunyogtól, ))). Found an la element. Its text is (((félt))). The text immediately af +ter is (( a szúrós nádtól,))). Found an la element. Its text is (((Félt))). The text immediately af +ter is (( az üldözőknek távoli zajától,))). =end output =cut

If your document is too large for this, it's still worth to use as few handlers as possible. For example, use a single handler for whatever element represents a whole headword entry, and in this handler, iterate on the la elements in this element. When that handler is executed, the whole entry have already been parsed, including the text after the la element, so you can access it.

if (1) { my $line_handler = sub { my($tw1, $li) = @_; print "In the line handler. Full line is (((" . $li->sprint . + ")))\n"; for my $la ($li->findnodes("//la")) { my $t = $la->text; my $ta = $la->next_sibling_text; print "Found an la element. Its text is ((($t))). The te +xt immediately after is (($ta))).\n"; } $tw1->purge; }; my $tw = XML::Twig->new( twig_handlers => {"line" => $line_handler}, %xmlopt ); $tw->parse($doc); }
=begin output In the line handler. Full line is (((<line>Majd az édes álom pillangó + képében</line>))) In the line handler. Full line is (((<line><la>Elvetődött</la> arra t +arka köntösében,</line>))) Found an la element. Its text is (((Elvetődött))). The text immediat +ely after is (( arra tarka köntösében,))). In the line handler. Full line is (((<line>De nem <la>mert</la> szemé +re szállni még sokáig,</line>))) Found an la element. Its text is (((mert))). The text immediately af +ter is (( szemére szállni még sokáig,))). In the line handler. Full line is (((<line>Szinte a pirosló hajnal ha +sadtáig.</line>))) In the line handler. Full line is (((<line>Mert <la>félt</la> a szuny +ogtól, <la>félt</la> a szúrós nádtól,</line>))) Found an la element. Its text is (((félt))). The text immediately af +ter is (( a szunyogtól, ))). Found an la element. Its text is (((félt))). The text immediately af +ter is (( a szúrós nádtól,))). In the line handler. Full line is (((<line>Jobban a nádasnak csörtető + vadától,</line>))) In the line handler. Full line is (((<line><la>Félt</la> az üldözőkne +k távoli zajától,</line>))) Found an la element. Its text is (((Félt))). The text immediately af +ter is (( az üldözőknek távoli zajától,))). In the line handler. Full line is (((<line>De legis-legjobban Toldi n +agy bajától.</line>))) =end output =cut

In reply to Re: XML:: Twig - can you check for text following the element being handled? by ambrus
in thread XML:: Twig - can you check for text following the element being handled? by mertserger

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • Outside of code tags, you may need to use entities for some characters:
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?
    Username:
    Password:

    What's my password?
    Create A New User
    Chatterbox?
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others avoiding work at the Monastery: (4)
    As of 2014-07-24 02:30 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?

      My favorite superfluous repetitious redundant duplicative phrase is:









      Results (156 votes), past polls